=Paper= {{Paper |id=Vol-2848/user2agent_paper_3 |storemode=property |title=A Multi-Turn Emotionally Engaging Dialog Model |pdfUrl=https://ceur-ws.org/Vol-2848/user2agent-paper-3.pdf |volume=Vol-2848 |authors=Yubo Xie,Ekaterina Svikhnushina,Pearl Pu |dblpUrl=https://dblp.org/rec/conf/iui/XieSP20 }} ==A Multi-Turn Emotionally Engaging Dialog Model== https://ceur-ws.org/Vol-2848/user2agent-paper-3.pdf
     A Multi-Turn Emotionally Engaging Dialog Model
                                                                 Yubo Xie
                                                        Ekaterina Svikhnushina
                                                                 Pearl Pu
                                                             yubo.xie@epfl.ch
                                                      ekaterina.svikhnushina@epfl.ch
                                                             pearl.pu@epfl.ch
                                                 École Polytechnique Fédérale de Lausanne
                                                           Lausanne, Switzerland
ABSTRACT                                                               1 INTRODUCTION
Open-domain dialog systems (also known as chatbots) have               Many application areas show significant benefits of integrat-
increasingly drawn attention in natural language process-              ing affect information in natural language dialogs. In earlier
ing. Some of the recent work aims at incorporating affect              work on human computer interaction, Klein et al. [16] found
information into sequence-to-sequence neural dialog model-             user’s frustration caused by a computer system can be allevi-
ing, making the response emotionally richer, while others              ated by computer-initiated emotional support, by providing
use hand-crafted rules to determine the desired emotion                feedback on emotional content along with sympathy and
response. However, they do not explicitly learn the subtle             empathy. Recently, Hu et al. [14] developed a customer sup-
emotional interactions captured in human dialogs. In this pa-          port neural chatbot, capable of generating dialogs similar
per, we propose a multi-turn dialog system aimed at learning           to the humans in terms of empathic and passionate tones,
and generating emotional responses that so far only humans             potentially serving as proxy customer support agents on
know how to do. Compared with two baseline models, of-                 social media platforms. In a qualitative study [47], partici-
fline experiments show that our method performs the best               pants expressed an interest in chatbots capable of serving
in perplexity scores. Further human evaluations confirm                as an attentive listener and providing motivational support,
that our chatbot can keep track of the conversation context            thus fulfilling users’ emotional needs. Several participants
and generate emotionally more appropriate responses while              even noted a chatbot is ideal for sensitive content that is too
performing equally well on grammar.                                    embarrassing to ask another human. Finally Bickmore and
                                                                       Picard [3] showed a relational agent with deliberate social-
CCS CONCEPTS                                                           emotional skills was respected more, liked more, and trusted
• Human-centered computing → Human computer in-                        more, even after four weeks of interaction, compared to an
teraction (HCI); Natural language interfaces.                          equivalent task-oriented agent.
                                                                          Recent development in neural language modeling has gen-
KEYWORDS                                                               erated significant excitement in the open-domain dialog gen-
chatbots, affective c omputing, d eep l earning, n atural lan-         eration community. The success of sequence-to-sequence
guage processing                                                       (seq2seq) learning [5, 37] in the field of neural machine trans-
                                                                       lation has inspired researchers to apply the recurrent neural
ACM Reference Format:                                                  network (RNN) encoder-decoder structure to response gener-
Yubo Xie, Ekaterina Svikhnushina, and Pearl Pu. 2020. A Multi-         ation [42]. Following the standard seq2seq structure, various
Turn Emotionally Engaging Dialog Model. In IUI ’20 Wokrshops,          improvements have been made on the neural conversation
March 17, 2020, Cagliari, Italy. ACM, New York, NY, USA, 12 pages.
                                                                       model. For example, Shang et al. [34] applied attention mech-
https://doi.org/10.1145/nnnnnnn.nnnnnnn
                                                                       anism [2] to the same structure on Twitter-style microblog-
                                                                       ging data. Li et al. [17] found the original version tend to
Copyright © 2020 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                       favor short and dull responses. They fixed this problem by
                                                                       increasing the diversity of the response. Li et al. [18] mod-
                                                                       eled the personalities of the speakers, and Xing et al. [44]
                                                                       developed a topic aware dialog system. We call work in this
                                                                       area globally neural dialog generation. For a comprehensive
                                                                       survey, please refer to [4].
                                                                          More recently, researchers started incorporating affect in-
                                                                       formation into neural dialog models. While a central theme
IUI ’20 Workshops, March 17, 2020, Cagliari, Italy                                                                          Xie, et al.


seems to be making the responses emotionally richer, ex-           2   RELATED WORK
isting approaches mainly follow two directions. In one, an         Neural Dialog Generation
emotion label is explicitly required as input so that the ma-
                                                                   Vinyals and Le [42] were one of the first to model dialog gen-
chine can generate sentences of that particular emotion label
                                                                   eration using neural networks. Their seq2seq framework
or type [49]. In another group of work, the main idea is to de-
                                                                   was trained on an IT Helpdesk Troubleshooting dataset
velop handcrafted rules to direct the machines to generated
                                                                   and the OpenSubtitles dataset [21]. Shang et al. [34] fur-
responses of the desired emotions [1, 48]. Both approaches re-
                                                                   ther trained the seq2seq model with attention mechanism
quire an emotion label as input (either given or handcrafted),
                                                                   on a self-crawled Weibo (a popular Twitter-like social media
which might be unpractical in real dialog scenarios.
                                                                   website in China) dataset. Meanwhile, Xu et al. [46] built a
   Furthermore, to the best of our knowledge, the psychology
                                                                   customer service chatbot by training the seq2seq model on
and social science literature does not provide clear rules for
                                                                   a dataset collected with conversations between customers
emotional interaction. It seems such social and emotional
                                                                   and customer service accounts from 62 brands on Twitter.
intelligence is captured in our conversations. This is why we
                                                                      The standard seq2seq framework is applied to single-turn
decided to take the automatic and data-driven approach. In
                                                                   response generation. In multi-turn settings, where a context
this paper, we describe an end-to-end Multi-turn Emotionally
                                                                   with multiple history utterances is given, the same struc-
Engaging Dialog model (MEED), capable of recognizing emo-
                                                                   ture often ignores the hierarchical characteristic of the con-
tions and generating emotionally appropriate and human-
                                                                   text. Some recent work addresses this problem by adopt-
like responses with the ultimate goal of reproducing social
                                                                   ing a hierarchical recurrent encoder-decoder (HRED) struc-
behaviors that are habitual in human-human conversations.
                                                                   ture [32, 33, 35]. To give attention to different parts of the
We chose the multi-turn setting because a model suitable for
                                                                   context while generating responses, Xing et al. [45] proposed
single-turn dialogs cannot effectively track earlier context in
                                                                   the hierarchical recurrent attention network (HRAN), using
multi-turn dialogs, both semantically and emotionally. Since
                                                                   a hierarchical attention mechanism. However, these multi-
being able to track several turns is really important, we made
                                                                   turn dialog models do not take into account the turn-taking
this design decision from the beginning, in contrast to most
                                                                   emotional changes of the dialog.
related work where models are only trained and tested on
single-turn dialogs. While using a hierarchical mechanism
to track the conversation history in multi-turn dialogs is         Neural Dialog Models with Affect Information
not new (e.g., HRAN by Xing et al. [45]), to combine it with       Recent work on incorporating affect information into natural
an additional emotion RNN to process the emotional infor-          language processing tasks has inspired our current work.
mation in each history utterance has never been attempted          They can be mainly described as affect language models and
before.                                                            emotional dialog systems.
   Our contributions are threefold. (1) We describe in detail a       Ghosh et al. [11] made the first attempt to augment the
novel emotion-tracking dialog generation model that learns         original LSTM language model with affect treatment in what
the emotional interactions directly from the data. This ap-        they called Affect-LM. At training time, Affect-LM can be
proach is free of human-defined heuristic rules, and hence,        considered as an energy based model where the added en-
is more robust and fundamental than those described in ex-         ergy term captures the degree of correlation between the
isting work. (2) We compare our model, MEED, with the              next word and the affect information of the preceeding text.
generic seq2seq model and the hierarchical model of multi-         At text generation time, affect information is also used to
turn dialogs (HRAN). Offline experiments show that our             increase the appropriate selection of the next word. A key
model outperforms both seq2seq and HRAN by a significant           component in Affect-LM is the use of a well established
amount. Further experiments with human evaluation show             text analysis program, LIWC (Linguistic Inquiry and Word
our model produces emotionally more appropriate responses          Count) [28]. For every sentence, for example, “I unfortunately
than both baselines, while also improving the language flu-        did not pass my exam”, the model generates five emotion fea-
ency. (3) We illustrate a human-evaluation procedure for           tures denoting (sad: 1, angry: 1, anxiety: 1, negative emotion: 1,
judging machine produced emotional dialogs. We consider            positive emotion: 0). This makes Affect-LM both capable of
factors such as the balance of positive and negative emotions      distinguishing affect information conveyed by each word
in test dialogs, a well-chosen range of topics, and dialogs that   in the language modeling part and aware of the preceeding
our human evaluators can relate. It is the first time such an      text’s emotion in each generation step. In a similar vein, As-
approach is designed with consideration for human judges.          ghar et al. [1] appended the original word embeddings with
Our main goal is to increase the objectivity of the results and    a VAD affect model [43]. VAD is a vector model, as opposed
reduce judges’ mistakes due to out-of-context dialogs they         to a categorical model (LIWC), representing a given emo-
have to evaluate.                                                  tion in each of the valence, arousal, and dominance axes. In
A Multi-Turn Emotionally Engaging Dialog Model                                          IUI ’20 Workshops, March 17, 2020, Cagliari, Italy


contrast to Affect-LM, Asghar’s neural affect dialog model        because it is a well-established emotion lexical resource, cov-
aims at generating explicit responses given a particular utter-   ering the whole English dictionary whereas VAD only con-
ance. To do so, the authors designed three affect-related loss    tains 13K lemmatized terms.
functions, namely minimizing affect dissonance, maximiz-
ing a affective dissonance, and maximizing affective content.     3   MODEL
The paper also proposed the affectively diverse beam search       We describe our model one element at a time, from the basic
during decoding, so that the generated candidate responses        structure, to the hierarchical component, and finally the
are as affectively diverse as possible. However, literature in    emotion embedding layer.
affective science does not necessarily validate such rules.          We first consider the problem of generating response y
In fact, the best strategy to speak to an angry customer is       given a context X consisting of multiple previous utterances
the de-escalation strategy (using neutral words to validate       by estimating the probability distribution p(y | X ) from a
anger) rather than employing equally emotional words (min-                                      N containing N context-response
                                                                  data set D = {(X (i), y (i) )}i=1
imizing affect dissonance) or words that convey happiness         pairs. Here
(maximizing affect dissonance).                                                    X (i) = x 1(i), x 2(i), . . . , xm
                                                                                                                    (i) 
                                                                                                                      i
                                                                                                                             (1)
   The Emotional Chatting Machine (ECM) [49] takes a post
                                                                  is a sequence of mi utterances, and
and generates a response in a predefined emotion category.
                                                                                    x j(i) = x j,1
                                                                                               (i) (i)                (i)
                                                                                                   , x j,2, . . . , x j,n
                                                                                                                                
The main idea is to use an internal memory module to cap-                                                                 ij
                                                                                                                                      (2)
ture the emotion dynamics during decoding, and an external
memory module to model emotional expressions explicitly           is a sequence of ni j words. Similarly,
by assigning different probability values to emotional words                         y (i) = y1(i), y2(i), . . . , yT(i)i
                                                                                                                            
                                                                                                                                      (3)
as opposed to regular words. Zhou and Wang [50] extended
the standard seq2seq model to a conditional variational au-       is the response with Ti words.
toencoder combined with policy gradient techniques. The              Usually the probability distribution p(y | X ) can be mod-
model takes a post and an emoji as input, and generates the       eled by an RNN language model conditioned on X . When
response with target emotion specified by the emoji. Hu et        generating the word yt at time step t, the context X is en-
al. [14] built a tone-aware chatbot for customer care on social   coded into a fixed-sized dialog context vector c t by following
media, by deploying extra meta information of the conversa-       the hierarchical attention structure in HRAN [45]. Addition-
tions in the seq2seq model. Specifically, a tone indicator is     ally, we extract the emotion information from the utterances
added to each step of the decoder during the training phase.      in X by leveraging an external text analysis program, and
   In parallel to these developments, Zhong et al. [48] pro-      use an RNN to encode it into an emotion context vector e,
posed an affect-rich dialog model using biased attention          which is combined with c t to produce the distribution. The
mechanism on emotional words in the input message, by tak-        overall architecture of the model is depicted in Figure 1. We
ing advantage of the VAD embeddings. The model is trained         are going to elaborate on how to obtain c t and e, and how
with a weighted cross-entropy loss function, which encour-        they are combined in the decoding part.
ages the generation of emotional words.
                                                                  Hierarchical Attention
                                                                  The hierarchical attention structure involves two encoders to
Summary                                                           produce the dialog context vector c t , namely the word-level
As much as these work in the above section inspired our           encoder and the utterance-level encoder. The word-level en-
work, our approach in generating affect dialogs is signifi-       coder is essentially a bidirectional RNN with gated recurrent
cantly different. Most of related work focused on integrating     units (GRU) [5]. For utterance x j in X (j = 1, 2, . . . , m), the
affect information into the transduction vector space using       bidirectional encoder produces two hidden states at each
either VAD or LIWC, we aim at modeling and generating             word position k, the forward hidden state h fjk and the back-
the affect exchanges in human dialogs using a dedicated em-       ward hidden state h bjk . The final hidden state h jk is then
bedding layer. The approach is also completely data-driven,       obtained by concatenating the two,
thus absent of hand-crafted rules. To avoid learning obscene
                                                                                      h jk = concat h fjk , h bjk .
                                                                                                                 
                                                                                                                                  (4)
and callous exchanges often found in social media data like
tweets and Reddit threads [29], we opted to train our model       The utterance-level encoder is a unidirectional RNN with
on movie subtitles, whose dialogs were carefully created by       GRU that goes from the last utterance in the context to
professional writers. We believe the quality of this dataset      the first, with its input at each step as the summary of the
can be better than those curated by crowdsource platforms.        corresponding utterance, which is obtained by applying a
For modeling the affect information, we chose to use LIWC         Bahdanau-style attention mechanism [2] on the word-level
IUI ’20 Workshops, March 17, 2020, Cagliari, Italy                                                                                                                Xie, et al.


                                                                                                   Softmax

                                                      st −1           st                                                                                   e
                                                                                                     ot
                                      ···             GRU            GRU              ···                      GRU      ···       GRU         ···      GRU

                                                                    c t wyt −1                                 a1                  aj                  am

                                                    Utterance-level attention                                           Emotion embedding layer

                            `1t                              `tj           `tj+1                           t
                                                                                                          `m                  0 1 1 0 0 0

                           GRU               ···             GRU           GRU               ···      GRU
                                                                                                                              LIWC program

                            r 1t                             r jt            t
                                                                           r j+1                        t
                                                                                                       rm                          xj

                                               Word-level attention

                                   h j,1           h j,2                           h j,n j                              GRU      Decoder

                                                                                                                        GRU      Word-level encoder
                                   GRU             GRU              ···            GRU
                                                                                                                        GRU      Utterance-level encoder
                                   wx j ,1         wx j ,2                     wx j ,n j
                                                                                                                        GRU      Emotion encoder
                                                             xj

                                                             Figure 1: The overall architecture of our model.


encoder output. More specifically, at decoding step t, the                                                Here β jt is the utterance-level attention score placed on ℓtj ,
summary of utterance x j is a linear combination of h jk , for                                            and can be calculated as
k = 1, 2, . . . , n j ,
                                                                                                                          b jt = vbT tanh(Ub st −1 + Wb ℓtj ),           (9)
                                     nj
                                                                                                                                     exp(b jt )
                           r jt =              t
                                     Õ
                                             α jk h jk .                                     (5)                          β jt = Ím             t ,                    (10)
                                     k =1                                                                                         j ′ =1 exp(b j ′ )

       t is the word-level attention score placed on h , and
Here α jk                                                                                                 where st −1 is the previous hidden state of the decoder, and
                                                      jk
can be calculated as                                                                                      vb , Ub and Wb are utterance-level attention parameters.

                                                                                                          Emotion Encoder
           atjk = vaT tanh(Ua st −1 + Va ℓtj+1 + Wa h jk ),                                  (6)
                                                                                                          The main objective of the emotion embedding layer is to
                      exp(atjk )                                                                          recognize the affect information in the given utterances so
             t
           α jk = Ín j                 ,                                                     (7)
                   k ′ =1
                          exp(atjk ′ )                                                                    that the model can respond with emotionally appropriate
                                                                                                          replies. To achieve this, we need an encoder to distinguish the
where st −1 is the previous hidden state of the decoder, ℓtj+1 is                                         affect information in the context, in addition to its semantic
the previous hidden state of the utterance-level encoder, and                                             meaning. Equally we need a decoder capable of selecting the
va , Ua , Va and Wa are word-level attention parameters. The                                              best and most human-like answers.
final dialog context vector c t is then obtained as another lin-                                             We are able to achieve this goal, i.e., capturing the emotion
ear combination of the outputs of the utterance-level encoder                                             information carried in the context X , in the encoder, thanks
ℓtj , for j = 1, 2, . . . , m,                                                                            to LIWC. We make use of the five emotion-related categories,
                                                                                                          namely positive emotion, negative emotion, anxious, angry,
                                      m                                                                   and sad. This set can be expanded to include more categories
                                              β jt ℓtj .
                                      Õ
                            ct =                                                             (8)          if we desire a richer distinction. See the discussion section for
                                       j=1                                                                more details on how to do this. Using the newest version of
A Multi-Turn Emotionally Engaging Dialog Model                                                               IUI ’20 Workshops, March 17, 2020, Cagliari, Italy


the program LIWC2015,1 we are able to map each utterance                                           Table 1: Statistics of the two datasets.
x j in the context to a six-dimensional indicator vector 1(x j ),
with the first five entries corresponding to the five emotion                                                                  Cornell      DailyDialog
categories, and the last one corresponding to neutral. If any
                                                                                         # dialogs                              83,097               13,118
word in x j belongs to one of the five categories, then the
                                                                                         # utterances                          304,713              102,977
corresponding entry in 1(x j ) is set to 1; otherwise, x j is
                                                                                         Average # turns                            3.7                  7.9
treated as neutral, with the last entry of 1(x j ) set to 1. For
                                                                                         Average # words / utterance              12.5                 14.6
example, assuming x j = “he is worried about me”, then
                                                                                         Training set size                     142,450               46,797
                       1(x j ) = [0, 1, 1, 0, 0, 0],                          (11)       Validation set size                    10,240               10,240
since the word “worried” is assigned to both negative emotion
and anxious. We apply a dense layer with sigmoid activation
                                                                                     We use the cross-entropy loss as our objective function
function on top of 1(x j ) to embed the emotion indicator
                                                                                                                       N
vector into a continuous space,                                                                             1          Õ
                                                                                                    L = − ÍN                log p y (i) | X (i) .
                                                                                                                                               
                                                                                                                                                           (18)
                      a j = σ (We 1(x j ) + be ),                             (12)                             i=1 Ti i=1

where We and be are trainable parameters. The emotion flow                           4   EVALUATION
of the context X is then modeled by an unidirectional RNN                            We trained our model using two different datasets and com-
with GRU going from the first utterance in the context to the                        pared its performance with HRAN as well as the basic seq2seq
last, with its input being a j at each step. The final emotion                       model by performing both offline and online testings.
context vector e is obtained as the last hidden state of this
emotion encoding RNN.                                                                Datasets
                                                                                     We used two different dialog corpora to train our model—
Decoding
                                                                                     the Cornell Movie Dialogs Corpus [6] and the DailyDialog
The probability distribution p(y | X ) can be written as                             dataset [20].
   p(y | X ) = p(y1, y2, . . . , yT | X )                                                • Cornell Movie Dialogs Corpus. The dataset con-
                                 T                                                          tains 83,097 dialogs (220,579 conversational exchanges)
                                                                                            extracted from raw movie scripts. In total there are
                                 Ö
              = p(y1 | c 1, e)          p(yt | y1, . . . , yt −1, c t , e).   (13)
                                 t =2                                                       304,713 utterances.
                                                                                         • DailyDialog. The dataset is developed by crawling
We model the probability distribution using an RNN lan-
                                                                                            raw data from websites used for language learners to
guage model along with the emotion context vector e. Specif-
                                                                                            learn English dialogs in daily life. It contains 13,118
ically, at time step t, the hidden state of the decoder st is
                                                                                            dialogs in total.
obtained by applying the GRU function,
                                                                                     We summarize some of the basic information regarding the
                st = GRU(st −1, concat(c t , wyt −1 )),                       (14)   two datasets in Table 1.
                                                                                        In our experiments, the models were first trained on the
where wyt −1 is the word embedding of yt −1 . Similar to Affect-
                                                                                     Cornell Movie Dialogs Corpus, and then fine-tuned on the
LM [11], we then define a new feature vector ot by concate-
                                                                                     DailyDialog dataset. We adopted this training pattern be-
nating st (which we refer to as the language context vector)
                                                                                     cause the Cornell dataset is bigger but noisier, while DailyDi-
with the emotion context vector e,
                                                                                     alog is smaller but more daily-based. To create a training set
                         ot = concat(st , e),                                 (15)   and a validation set for each of the two datasets, we took seg-
                                                                                     ments of each dialog with number of turns no more than six,2
on which we apply a softmax layer to obtain a probability
                                                                                     to serve as the training/validation examples. Specifically, for
distribution over the vocabulary,
                                                                                     each dialog D = (x 1, x 2, . . . , x M ), we created M − 1 context-
                     p t = softmax(W ot + b),                                 (16)   response pairs, namely Ui = (x si , . . . , x i ) and yi = x i+1 , for
                                                                                     i = 1, 2, . . . , M − 1, where si = max(1, i − 4). We filtered
where W and b are trainable parameters. Each term in Equa-                           out those pairs that have at least one utterance with length
tion (13) is then given by                                                           greater than 30. We also reduced the frequency of those pairs
                p(yt | y1, . . . , yt −1, c t , e) = p t ,yt .                (17)   2 We chose the maximum number of turns to be six because we would like

                                                                                     to have a longer context for each dialog while at the same time keeping the
1 https://liwc.wpengine.com/                                                         training procedure computationally efficient.
IUI ’20 Workshops, March 17, 2020, Cagliari, Italy                                                                                    Xie, et al.


whose responses appear too many times (the threshold is set        BLEU score [27] tend to align poorly with human judgement.
to 10 for Cornell, and 5 for DailyDialog), to prevent them         Therefore, in this paper, we mainly adopt human evaluation,
from dominating the learning procedure. See Table 1 for the        along with perplexity and BLEU score, following the existing
sizes of the training and validation sets. The test set consists   work.
of 100 dialogs with four turns. We give more detailed descrip-
tion of how we created the test set in the section of human        Automatic Evaluation. Perplexity is a measurement of how a
evaluation.                                                        probability model predicts a sample. It is a popular method
                                                                   used in language modeling. In neural dialog generation com-
Baselines and Implementation                                       munity, many researchers have adopted this method, espe-
Our choice of including S2S is rather obvious. Including           cially in the beginning of this field [32, 42, 45, 48–50]. It
HRAN instead of other neural dialog models with affect             measures how well a dialog model predicts the target re-
information was not an easy decision. As mentioned in the          sponse. Given a target response y = {y1, y2, . . . , yT }, the
related work, Asghar’s affective dialog model, the affect-rich     perplexity is calculated as
conversation model, and the Emotional Chatting Machine do
                                                                       ppl(y) = p(y1, y2, . . . , yT )−1/T
not learn the emotional exchanges in the dialogs. This leaves
                                                                                               T
                                                                                    "                                            #
us wondering whether using a multi-turn neural model can                                  1Õ
be as effective in learning emotional exchanges as MEED. In                   = exp −              log p(yt | y1, . . . , yt −1 ) .        (19)
                                                                                         T t =1
addition, comparing S2S and HRAN also gives us an idea of
how much the hierarchical mechansim is improving upon              Thus a lower perplexity score indicates that the model has
the basic model. This is why our final comparision is based        better capability of predicting the target sentence, i.e., the
on three multi-turn dialog generation models: the standard         humans’ response. Some researchers [19, 34, 48] argue that
seq2seq model (denoted as S2S), HRAN, and our proposed             perplexity score is not the ideal measurement because for
model, MEED. In order to adapt S2S to the multi-turn setting,      a given context history, one should allow many responses.
we concatenate all the history utterances in the context into      This is especially true if we want our conversational agents
one.                                                               to speak more diversely. However, for our purpose, which
   For all the models, the vocabulary consists of 20,000 most      is to speak emotionally appropriately and as human-like as
frequent words in the Cornell and DailyDialog datasets, plus       possible, we believe this is a good measure. We do recognize
three extra tokens:  for words that do not exist in the       that it is not the only way to measure chatbots’ performance.
vocabulary,  indicating the begin of an utterance, and         This is why we also conducted human evaluation experiment.
 indicating the end of an utterance. Here we summarize           BLEU score is often used to measure the quality of machine-
the configurations and parameters of our experiments:              translated text. Some earlier work of dialog response genera-
    • We set the word embedding size to 256. We initialized        tion [17, 18] adopted this metric to measure the performance
      the word embeddings in the models with word2vec [26]         of chatbots. However, recent study [22] suggests that it does
      vectors first trained on Cornell and then fine-tuned on      not align well with human evaluation. Nevertheless, we still
      DailyDialog, consistent with the training procedure of       include BLEU scores in this paper, to get a sense of compari-
      the models.                                                  son with perplexity and human evaluation results.
    • We set the number of hidden units of each RNN to 256,
      the word-level attention depth to 256, and utterance-        Human Evaluation. Human evaluation has been widely used
      level 128. The output size of the emotion embedding          to evaluate open-domain dialog generation tasks. This ap-
      layer is 256.                                                proach can include any criterion as we judge appropriate.
    • We optimized the objective function using the Adam           Most commonly, researchers have included the model’s abil-
      optimizer [15] with an initial learning rate of 0.001.       ity to generate grammatically correct, contextually coherent,
    • For prediction, we used beam search [39] with a beam         and emotionally appropriate responses, of which the latter
      width of 256.                                                two properties cannot be reliably evaluated using automatic
                                                                   metrics. Recent work [1, 48, 49] on affect-rich conversational
We have made the source code publicly available.3
                                                                   chatbots turned to human opinion to evaluate both fluency
Evaluation Metrics                                                 and emotionality of their models. But such human experi-
                                                                   ments are sensitive to risk factors if the experiment is not
The evaluation of chatbots remains an open problem in the
                                                                   carefully designed. They include whether the intructions are
field. Recent work [22] has shown that the automatic eval-
                                                                   clear, whether they have been tested with users before hand,
uation metrics borrowed from machine translation such as
                                                                   and whether there is a good balance of the human judgement
3 https://github.com/yuboxie/meed                                  tasks. Further, if a test set for human evaluation is prepared
A Multi-Turn Emotionally Engaging Dialog Model                                                        IUI ’20 Workshops, March 17, 2020, Cagliari, Italy


by randomly sampling the dialogs from the dataset, it may in-                     as a bonus to the rater judged to be the most serious. For the
clude out-of-context dialogs, causing confusion and ambigu-                       evaluation survey, we also leveraged Google form. Specifi-
ity for human evaluators. Unbalanced emotional distribution                       cally, we randomly shuffled the 100 dialogs in the test set,
of the test dialogs may also lead to biased conclusions since                     then we used the first three utterances of each dialog as the
the chatbot’s abilities are evaluated on the unrepresentative                     input to the three models being compared (S2S, HRAN, and
sample.                                                                           MEED), and obtain the respective responses. Dialog contexts
   To take into account the above issues, we took several                         and three models’ responses were included into Google form.
iterations to prepare the instructions and the test set before                    According to the context given, the raters were instructed to
conducting the human evaluation experiment. Part of our                           evaluate the quality of the responses based on three criteria:
test set comes from the DailyDialog dataset, which consists                         (1) Grammatical correctness—whether or not the response
of meaningful complete dialogs. To compensate for the inbal-                            is fluent and free of grammatical mistakes;
ance, we further curated more negative emotion dialogs so                           (2) Contextual coherence—whether or not the response is
that the final set has equal emotion distributions. We provide                          context sensitive to the previous dialog history;
the details about the test data preparation process and the                         (3) Emotional appropriateness—whether or not the response
evaluation experiment below.                                                            conveys the right emotion and feels as if it had been
   Preparation of Natural Dialog Test Set. We first selected                            produced by a human.
the emotionally colored dialogs with exactly four turns from                      For each criterion, the raters gave scores of either 0, 1 or 2,
the DailyDialog dataset. In the dataset each dialog turn is                       where 0 means bad, 2 means good, and 1 indicates neutral. For
annotated with a corresponding emotional category, includ-                        this survey, the Google form was launched on 12 February
ing the neutral one. For our purposes we filtered out only                        2019, and all the submissions from our raters were collected
those dialogs where more than a half of utterances have                           by 14 February 2019.
non-neutral emotional labels, resulting in 78 emotionally
positive dialogs and 14 emotionally negative dialogs. We                          Results and Analysis
recruited two human workers to augment the data to pro-                           In this subsection, we present the experimental results of the
duce more emotionally negative dialogs. Both of them were                         automatic evaluation metric as well as human judgement,
PhD students from our university (males, aged 24 and 25),                         followed by some analysis.
fluent in English, and not related to the authors’ lab. We
found them via email and messaging platforms, and offered                         Automatic Evaluation Results. Table 2 gives the perplexity
80 CHF (or roughly US $80) gift coupons as incentive for                          and BLEU scores obtained by the three models on the two
each participant. The workers fulfilled the tasks in Google                       validation sets and the test set. As shown in the table, MEED
form4 following the instructions and created five negative                        achieves the lowest perplexity and the highest BLEU score on
dialogs with four turns, as if they were interacting with an-                     all three sets. We conducted t-test on the perplexity obtained,
other human, in each of the following topics: relationships,                      and results show significant improvements of MEED over S2S
entertainment, service, work and study, and everyday situa-                       and HRAN on the two validation sets (with p-value < 0.05).
tions. The Google form was released on 31 January 2019, and
                                                                                  Human Evaluation Results. Table 3, 4 and 5 summarize the
the workers finished their tasks by 4 February 2019. Sub-
                                                                                  human evaluation results on the responses’ grammatical
sequently, to form the final test set, we randomly selected
                                                                                  correctness, contextual coherence, and emotional appropri-
50 emotionally positive and 50 emotionally negative dialogs
                                                                                  ateness, respectively. In the tables, we give the percentage of
from the two pools of dialogs described above.
                                                                                  votes each model received for the three scores, the average
  Human Evaluation Experiment Design. In the final human                          score obtained, and the agreement score among the raters.
evaluation of the model, we recruited four more PhD stu-                          Note that we report Fleiss’ κ score [10] for contextual coher-
dents from our university (1 female and 3 males, aged 22–25).                     ence and emotional appropriateness, and Finn’s r score [9]
Three of them are fluent English speakers and one is a native                     for grammatical correctness. We did not use Fleiss’ κ score for
speaker. The recruitment proceeded in the same manner as                          grammatical correctness. As agreement is extremely high,
described above; the raters were offered 80 CHF (or roughly                       this can make Fleiss’ κ very sensitive to prevalence [13].
US $80) per participant gift coupons for fulfilling the task,                     On the contrary, we did not use Finn’s r score for contex-
and extra 20 CHF (or roughly US $20) coupon was promised                          tual coherence and emotional appropriateness because it is
4 We provide the link to the form used for creating the dialogs: https://forms.   only reasonable when the observed variance is significantly
gle/rPagMZYuYJ3M3Sq8A, hoping to help other researchers reproduce the
                                                                                  less than the chance variance [40], which did not apply to
same procedure. However, due to privacy concerns, we do not plan to release       these two criteria. As shown in the tables, we got high agree-
this dataset.                                                                     ment among the raters for grammatical correctness, and fair
IUI ’20 Workshops, March 17, 2020, Cagliari, Italy                                                                                 Xie, et al.


Table 2: Perplexity and average BLEU scores achieved by the models. Avg. BLEU: average of BLEU-1, -2, -3, and -4. Validation
set 1 comes from the Cornell dataset, and validation set 2 comes from the DailyDialog dataset.


                                               Perplexity                                         Avg. BLEU
                          Validation Set 1       Validation Set 2   Test Set   Validation Set 1    Validation Set 2   Test Set
              S2S                43.136               25.418        19.913           1.639              2.427          3.720
              HRAN               46.225               26.338        20.355           1.701              2.368          2.390
              MEED               41.862               24.341        19.795           1.829              2.635          4.281

Table 3: Human evaluation results on grammatical correct-                    of dialog response generation, perplexity does not align with
ness.                                                                        human judgement. In Table 2, for all the three sets, HRAN
                                                                             performs worse than S2S in terms of perplexity. However, for
                     +2     +1      0     Avg. Score        r                all of the three criteria in human evaluation, HRAN actually
                                                                             outperforms S2S. Based on this, we conclude that perplexity
        S2S  98.0 0.8 1.2                   1.968        0.915
                                                                             alone is not enough for evaluating a dialog system.
        HRAN 98.5 1.3 0.2                   1.982        0.967
        MEED 99.5 0.3 0.2                   1.992        0.981
                                                                             Visualization of Output Layer Weights. We may wonder how
Table 4: Human evaluation results on contextual coherence.                   HRAN and MEED differ in terms of the distributional rep-
                                                                             resentations of their respective vocabularies (words in the
                   +2      +1       0      Avg. Score           κ            language model, and affect words). We decided to visualize
                                                                             the output layer weights as word embedding representations
       S2S  25.8          19.7     54.5       0.713       0.389              using dimensionality reduction technique for the various
       HRAN 37.3          21.2     41.5       0.958       0.327              models.
       MEED 38.5          22.0     39.5       0.990       0.356                 In the decoding phase, Equation (16) takes ot , the concate-
                                                                             nation of the language context vector st and the emotion
Table 5: Human evaluation results on emotional appropri-                     context vector e, and generates a probability distribution
ateness.                                                                     over the vocabulary words by applying a softmax layer. The
                                                                             weight matrix of this softmax layer is denoted as W , whose
                   +2      +1       0      Avg. Score           κ            shape is |V |×2d, where |V | is the vocabulary size and d = 256
                                                                             is the hidden state size of the RNNs. Thus the ith row of the
       S2S  21.8          25.2     53.0       0.688       0.361              weight matrix Wi can be regarded as a vector representa-
       HRAN 30.5          28.5     41.0       0.895       0.387              tion of the ith word in the vocabulary. Since we concatenate
       MEED 32.0          27.8     40.2       0.917       0.337              the language context vector and the emotion context vec-
                                                                             tor as the input to the softmax layer, the first half of the
                                                                             weight vector Wi corresponds to the language context vec-
agreement among the raters for contextual coherence and
                                                                             tor, and the second half corresponds to the emotion context
emotional appropriateness.5 For grammatical correctness, all
                                                                             vector. We refer to them as language model weights and emo-
three models achieved high scores, which means all models
                                                                             tion weights, respectively. If the emotion embedding layer is
are capable of generating fluent utterances that make sense.
                                                                             learning and distinguishing affect states correctly, we will
For contextual coherence and emotional appropriateness,
                                                                             see clear differences in the visualization.
MEED achieved higher average scores than S2S and HRAN,
                                                                                With t-SNE [25], we are able to reduce the dimensionality
which means MEED keeps better track of the context and can
                                                                             of the weights to two, and visualize them in a straightforward
generate responses that are emotionally more appropriate
                                                                             way. For better illustration, we selected 100 most frequent
and natural. We first conducted Friedman test [12] and then
                                                                             (emotionally) positive words and 100 most frequent negative
t-test on the human evaluation results (contextual coherence
                                                                             words from the vocabulary, and used t-SNE to project the
and emotional appropriateness), showing the improvements
                                                                             corresponding language model weights and emotion weights
of MEED over S2S are significant (with p-value < 0.01).
                                                                             to two dimensions. Figure 2 gives the results in three sub-
   The comparison between perplexity scores and human
                                                                             plots. Since HRAN does not have the emotion context vector,
evaluation results further confirms the fact that in the context
                                                                             we just visualized the whole output layer weight vector,
5 https://en.wikipedia.org/wiki/Fleiss%27_kappa#Interpretation               which does a similar job as the language model weights in
A Multi-Turn Emotionally Engaging Dialog Model                                               IUI ’20 Workshops, March 17, 2020, Cagliari, Italy


          HRAN – Language Model Weights                 MEED – Language Model Weights                          MEED – Emotion Weights
                                                                                                6
                                                                                                                                        Negative
                                                  10
                                                                                                4                                       Positive
     5

                                                   5                                            2
     0
                                                                                                0
                                                   0

    −5                                                                                         −2
                                                  −5
                                                                                               −4
                                    Negative                                      Negative
 − 10
                                    Positive     − 10                             Positive     −6

         − 10    −5      0      5         10            −5         0         5          10             − 5.0      − 2.5   0.0   2.5      5.0


Figure 2: t-SNE visualization of the output layer weights in HRAN and MEED. 100 most frequent positive words and 100 most
frequent negative words are shown. The weight vectors in MEED are separated into two parts and visualized individually.


MEED. We can observe from the first two plots that posi-               on millions of tweets with emoji labels and is more suitable
tive words (green dots) and negative words (red dots) are              for tweet-like conversations. However, for DeepMoji, the 64
scattered around and mixed with each other in the language             categories of emojis do not have a clear and exact correspon-
model weights for HRAN and MEED respectively, which                    dence with standardized emotion categories, nor to the VAD
means no emotion information is captured in these weights.             vectors.
On the contrary, the emotion weights in MEED, in the last
plot, have a clearer clustering effect, i.e., positive words are       Training Data
mainly grouped on the top-left, while negative words are               We pre-trained our model on the Cornell movie subtitles and
mainly grouped at the bottom-right. This gives the hint that           then fine-tuned it with the DailyDialog dataset. We adopted
the emotion encoder in MEED is capable of tracking the                 this particular training order because we would like our chat-
emotion states in the conversation history.                            bot to talk more like human chit-chats, and the DailyDialog
                                                                       dataset, compared with the bigger Cornell dataset, is more
Case Study. We present four sample dialogs in Table 6, along
                                                                       daily-based. Since our model learns how to respond properly
with the responses generated by the three models. Dialog 1
                                                                       in a data-driven way, we believe having a training dataset
and 2 are emotionally positive and dialog 3 and 4 are neg-
                                                                       with good quality while being large enough plays an im-
ative. For the first two examples, we can see that MEED
                                                                       portant role in developing an engaging and user-friendly
is able to generate more emotional content (like “fun” and
                                                                       chatbot. Thus, in the future, we plan to train our model on
“congratulations”) that is appropriate according to the con-
                                                                       the multi-turn conversations that we have already extracted
text. For dialog 4, MEED responds in sympathy to the other
                                                                       from the much bigger OpenSubtitles corpus and the Empa-
speaker, which is consistent with the second utterance in the
                                                                       theticDialogues dataset.6
context. On the contrary, HRAN poses a question in reply,
contradicting the dialog history.                                      Evaluation
5    DISCUSSION                                                        Evaluation of dialog models remains an open problem in the
                                                                       response generation field. Early work [18, 30, 36] on response
In this section, we briefly discuss how our framework can
                                                                       generation used automatic evaluation metrics borrowed from
incorporate other components, as well as several directions
                                                                       the machine translation field, such as the BLEU score, to eval-
to extend it.
                                                                       uate dialog systems. Later on, Liu et al. [22] showed that these
Emotion Recognition                                                    metrics correlate poorly with human judgement. Recently,
                                                                       a number of researchers begain developing automatic and
To extract the affect information contained in the utterances,         data-driven evaluation methods [24, 38], with the ultimate
we used the LIWC text analysis program. We believe this                goal of replacing human evaluation. However they are still
emotion recognition step is vital for a dialog model to pro-           in an early stage. In this paper, we used both perplexity mea-
duce emotionally appropriate responses. However, the choice            sures and human judgement in our experiments to finalize
of emotion classifier is not strictly limited to LIWC. It could        our model. In other words, using the perplexity measures,
be replaced by other well-established affect recognizer or one         we were able to determine when to stop training our model.
that is more appropriate to the target domain. For example,            But this condition does not gurantee the optimal results until
we can consider using more fine-grained emotion categories
from GALC [31], or using DeepMoji [8], which was trained               6 https://github.com/facebookresearch/EmpatheticDialogues
IUI ’20 Workshops, March 17, 2020, Cagliari, Italy                                                                                  Xie, et al.


             Table 6: Sample model responses. For each dialog, the ground truth is included in a pair of parentheses.


             Context                                                                           Model Responses
          A: I’m happy to see you again.
                                                                                               S2S: I hope so.
          B: Mee too.
        1                                                                                      HRAN: Thanks a lot.
          A: We should do this more often.
                                                                                               MEED: That sounds like fun.
          (B: Okay, I’ll give you a ring next week.)
          A: Thank god! I am finished writing that service guide! It took me forever!
                                                                                               S2S: When?
          B: When did you finish?
        2                                                                                      HRAN: Why?
          A: This morning! No more overtime, and no more headaches!
                                                                                               MEED: Congratulations!
          (B: Well, I’m glad to hear it. Have a cup of coffee!)
          A: I think that’s settled.
                                                                                               S2S: What is it?
          B: I’m tired of your cut-and-dried opinions. Who do you think you are!
        3                                                                                      HRAN: What is it?
          A: How dare you speak to me like this.
                                                                                               MEED: Are you serious?
          (B: Why not?)
          A: This concert was awful.
                                                                                               S2S: Congratulations!
          B: Agreed, the musicians were not in harmony.
        4                                                                                      HRAN: Why not?
          A: It was too painful. Never again.
                                                                                               MEED: That’s true.
          (B: That’s for sure!)


human judgement test can validate them. We thus highly rec-        capable of recognizing and generating emotionally appropri-
ommend this combination, which is also a common practice           ate responses, which is the first step toward such a goal. We
in the research community [45, 48–50].                             have demonstrated how to do so by (1) modeling utterances
                                                                   with extra affect vectors, (2) creating an emotional encoding
Model Extensions                                                   mechanism that learns emotion exchanges in the dataset, (3)
Our model uses RNNs to encode the input sequences, and             curating a multi-turn and balanced dialog dataset, and (4)
GRU cells to capture long-term dependency among different          evaluating the model with offline and online experiments.
positions in the sequences. Recent advances in natural lan-        For future directions, we would like to investigate the diver-
guage understanding have proposed new network architec-            sity issue of the responses generated, possibly by extending
tures to process text input. Specifically, the Transformer [41]    the mutual information objective function [17] to multi-turn
uses pure attention mechanisms without any recurrence              settings. We would also like to adopt the Transformer archi-
structures. Compared with RNNs, the Transformer can cap-           tecture with pre-trained language model weights, and train
ture better long-term dependency due to the self-attention         our model on a much larger dataset, by extracting multi-turn
mechanism, which is free of locality biases, and is more ef-       dialogs from the OpenSubtitles corpus.
ficient to train because of better parallelization capability.
Following the Transformer architecture, researchers found           REFERENCES
that pre-training language models on huge amounts of data           [1] Nabiha Asghar, Pascal Poupart, Jesse Hoey, Xin Jiang, and Lili Mou.
                                                                        2018. Affective Neural Response Generation. In Proceedings of ECIR
could largely boost the performance of downstream tasks,
                                                                        2018. 154–166. https://doi.org/10.1007/978-3-319-76941-7_12
and published many pre-trained language models such as              [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural
BERT [7] and RoBERTa [23]. As future work, we would like                Machine Translation by Jointly Learning to Align and Translate. CoRR
to adopt the Transformer architecture to replace the RNNs               abs/1409.0473 (2014). arXiv:1409.0473 http://arxiv.org/abs/1409.0473
in our model, and initialize our encoder with pre-trained           [3] Timothy W. Bickmore and Rosalind W. Picard. 2005. Establishing and
                                                                        Maintaining Long-Term Human-Computer Relationships. ACM Trans.
language models. We hope to increase the performance of
                                                                        Comput.-Hum. Interact. 12, 2 (2005), 293–327. https://doi.org/10.1145/
response generation.                                                    1067860.1067867
                                                                    [4] Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. 2017. A
6   CONCLUSION                                                          Survey on Dialogue Systems: Recent Advances and New Frontiers.
                                                                        SIGKDD Explorations 19, 2 (2017), 25–35. https://doi.org/10.1145/
We believe reproducing conversational and emotional intel-              3166054.3166058
ligence will make social chatbots more believable and engag-        [5] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry
ing. In this paper, we proposed a multi-turn dialog system              Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014.
A Multi-Turn Emotionally Engaging Dialog Model                                                           IUI ’20 Workshops, March 17, 2020, Cagliari, Italy


     Learning Phrase Representations Using RNN Encoder-Decoder for               [22] Chia-Wei Liu, Ryan Lowe, Iulian Serban, Michael Noseworthy, Laurent
     Statistical Machine Translation. In Proceedings of EMNLP 2014. 1724–             Charlin, and Joelle Pineau. 2016. How NOT To Evaluate Your Dialogue
     1734. http://aclweb.org/anthology/D/D14/D14-1179.pdf                             System: An Empirical Study of Unsupervised Evaluation Metrics for
 [6] Cristian Danescu-Niculescu-Mizil and Lillian Lee. 2011. Chameleons               Dialogue Response Generation. In Proceedings of EMNLP 2016. 2122–
     in Imagined Conversations: A New Approach to Understanding Coor-                 2132. http://aclweb.org/anthology/D/D16/D16-1230.pdf
     dination of Linguistic Style in Dialogs. In Proceedings of CMCL@ACL         [23] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi
     2011. 76–87. https://aclanthology.info/papers/W11-0609/w11-0609                  Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.
 [7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.                2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach.
     2019. BERT: Pre-training of Deep Bidirectional Transformers for                  CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/
     Language Understanding. In Proceedings of NAACL-HLT 2019. 4171–                  1907.11692
     4186. https://aclweb.org/anthology/papers/N/N19/N19-1423/                   [24] Ryan Lowe, Michael Noseworthy, Iulian Vlad Serban, Nicolas Angelard-
 [8] Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune                Gontier, Yoshua Bengio, and Joelle Pineau. 2017. Towards an Automatic
     Lehmann. 2017. Using Millions of Emoji Occurrences to Learn Any-                 Turing Test: Learning to Evaluate Dialogue Responses. In Proceedings
     Domain Representations for Detecting Sentiment, Emotion and Sar-                 ACL 2017. 1116–1126. https://doi.org/10.18653/v1/P17-1103
     casm. In Proceedings of EMNLP 2017. 1615–1625. https://aclanthology.        [25] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data
     info/papers/D17-1169/d17-1169                                                    using t-SNE. Journal of Machine Learning Research 9, Nov (2008),
 [9] Robert H Finn. 1970. A Note on Estimating the Reliability of Categorical         2579–2605.
     Data. Educational and Psychological Measurement 30, 1 (1970), 71–76.        [26] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Ef-
[10] Joseph L Fleiss and Jacob Cohen. 1973. The Equivalence of Weighted               ficient Estimation of Word Representations in Vector Space. CoRR
     kappa and the Intraclass Correlation Coefficient as Measures of Relia-           abs/1301.3781 (2013). arXiv:1301.3781 http://arxiv.org/abs/1301.3781
     bility. Educational and psychological measurement 33, 3 (1973), 613–        [27] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002.
     619.                                                                             BLEU: A Method for Automatic Evaluation of Machine Translation. In
[11] Sayan Ghosh, Mathieu Chollet, Eugene Laksana, Louis-Philippe                     Proceedings of ACL 2002. 311–318. http://www.aclweb.org/anthology/
     Morency, and Stefan Scherer. 2017. Affect-LM: A Neural Language                  P02-1040.pdf
     Model for Customizable Affective Text Generation. In Proceedings of         [28] James W Pennebaker, Martha E Francis, and Roger J Booth. 2001.
     ACL 2017. 634–642. https://doi.org/10.18653/v1/P17-1059                          Linguistic Inquiry and Word Count: LIWC 2001. Mahway: Lawrence
[12] David C Howell. 2016. Fundamental Statistics for the Behavioral Sciences.        Erlbaum Associates 71, 2001 (2001), 2001.
     Nelson Education.                                                           [29] Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau.
[13] George Hripcsak and Daniel F. Heitjan. 2002. Measuring Agreement in              2019. Towards Empathetic Open-domain Conversation Models: A
     Medical Informatics Reliability Studies. Journal of Biomedical Informat-         New Benchmark and Dataset. In Proceedings of ACL 2019. 5370–5381.
     ics 35, 2 (2002), 99–110. https://doi.org/10.1016/S1532-0464(02)00500-2          https://www.aclweb.org/anthology/P19-1534/
[14] Tianran Hu, Anbang Xu, Zhe Liu, Quanzeng You, Yufan Guo, Vibha              [30] Alan Ritter, Colin Cherry, and William B. Dolan. 2011. Data-Driven
     Sinha, Jiebo Luo, and Rama Akkiraju. 2018. Touch Your Heart: A Tone-             Response Generation in Social Media. In Proceedings of EMNLP 2011.
     aware Chatbot for Customer Care on Social Media. In Proceedings of               583–593. http://www.aclweb.org/anthology/D11-1054
     CHI 2018. 415. https://doi.org/10.1145/3173574.3173989                      [31] Klaus R Scherer. 2005. What Are Emotions? And How Can They Be
[15] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Sto-                   Measured? Social science information 44, 4 (2005), 695–729.
     chastic Optimization. CoRR abs/1412.6980 (2014). arXiv:1412.6980            [32] Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C.
     http://arxiv.org/abs/1412.6980                                                   Courville, and Joelle Pineau. 2016. Building End-To-End Dialogue
[16] Jonathan Klein, Youngme Moon, and Rosalind W. Picard. 2001. This                 Systems Using Generative Hierarchical Neural Network Models. In
     Computer Responds to User Frustration: Theory, Design, and Results.              Proceedings of AAAI 2016. 3776–3784. http://www.aaai.org/ocs/index.
     Interacting with Computers 14, 2 (2001), 119–140. https://doi.org/10.            php/AAAI/AAAI16/paper/view/11957
     1016/S0953-5438(01)00053-4                                                  [33] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Char-
[17] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan.           lin, Joelle Pineau, Aaron C. Courville, and Yoshua Bengio. 2017. A
     2016. A Diversity-Promoting Objective Function for Neural Conver-                Hierarchical Latent Variable Encoder-Decoder Model for Generating
     sation Models. In Proceedings of NAACL-HLT 2016. 110–119. http:                  Dialogues. In Proceedings of AAAI 2017. 3295–3301. http://aaai.org/
     //aclweb.org/anthology/N/N16/N16-1014.pdf                                        ocs/index.php/AAAI/AAAI17/paper/view/14567
[18] Jiwei Li, Michel Galley, Chris Brockett, Georgios P. Spithourakis, Jian-    [34] Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural Responding
     feng Gao, and William B. Dolan. 2016. A Persona-Based Neural Conver-             Machine for Short-Text Conversation. In Proceedings of ACL-IJCNLP
     sation Model. In Proceedings of ACL 2016. http://aclweb.org/anthology/           2015. 1577–1586. http://aclweb.org/anthology/P/P15/P15-1152.pdf
     P/P16/P16-1094.pdf                                                          [35] Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Li-
[19] Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and             oma, Jakob Grue Simonsen, and Jian-Yun Nie. 2015. A Hierarchical
     Jianfeng Gao. 2016. Deep Reinforcement Learning for Dialogue Gen-                Recurrent Encoder-Decoder for Generative Context-Aware Query Sug-
     eration. In Proceedings of EMNLP 2016. 1192–1202. http://aclweb.org/             gestion. In Proceedings of CIKM 2015. 553–562. https://doi.org/10.1145/
     anthology/D/D16/D16-1127.pdf                                                     2806416.2806493
[20] Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu.      [36] Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett,
     2017. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset.              Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill
     In Proceedings of IJCNLP 2017.                                                   Dolan. 2015. A Neural Network Approach to Context-Sensitive Gen-
[21] Pierre Lison and Jörg Tiedemann. 2016. OpenSubtitles2016: Extract-               eration of Conversational Responses. In Proceedings of NAACL-HLT
     ing Large Parallel Corpora from Movie and TV Subtitles. In Proceed-              2015. 196–205. http://aclweb.org/anthology/N/N15/N15-1020.pdf
     ings of LREC 2016. http://www.lrec-conf.org/proceedings/lrec2016/           [37] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to
     summaries/947.html                                                               Sequence Learning with Neural Networks. In Proceedings of NIPS 2014.
                                                                                      3104–3112. http://papers.nips.cc/paper/5346-sequence-to-sequence-
IUI ’20 Workshops, March 17, 2020, Cagliari, Italy                                                                                          Xie, et al.


     learning-with-neural-networks                                               Proceedings of AAAI 2017. 3351–3357. http://aaai.org/ocs/index.php/
[38] Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui Yan. 2018. RUBER:            AAAI/AAAI17/paper/view/14563
     An Unsupervised Method for Automatic Evaluation of Open-Domain         [45] Chen Xing, Yu Wu, Wei Wu, Yalou Huang, and Ming Zhou. 2018.
     Dialog Systems. In Proceedings of AAAI 2018. 722–729. https://www.          Hierarchical Recurrent Attention Network for Response Generation.
     aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16179                         In Proceedings of AAAI 2018. 5610–5617. https://www.aaai.org/ocs/
[39] Christoph Tillmann and Hermann Ney. 2003. Word Reordering and a             index.php/AAAI/AAAI18/paper/view/16510
     Dynamic Programming Beam Search Algorithm for Statistical Machine      [46] Anbang Xu, Zhe Liu, Yufan Guo, Vibha Sinha, and Rama Akkiraju. 2017.
     Translation. Computational Linguistics 29, 1 (2003), 97–133. https:         A New Chatbot for Customer Service on Social Media. In Proceedings
     //doi.org/10.1162/089120103321337458                                        of CHI 2017. 3506–3510. https://doi.org/10.1145/3025453.3025496
[40] Howard E Tinsley and David J Weiss. 1975. Interrater Reliability and   [47] Jennifer Zamora. 2017. I’m Sorry, Dave, I’m Afraid I Can’t Do That:
     Agreement of Subjective Judgments. Journal of Counseling Psychology         Chatbot Perception and Expectations. In Proceedings of HAI 2017. 253–
     22, 4 (1975), 358.                                                          260. https://doi.org/10.1145/3125739.3125766
[41] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion      [48] Peixiang Zhong, Di Wang, and Chunyan Miao. 2019. An Affect-Rich
     Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017.           Neural Conversational Model with Biased Attention and Weighted
     Attention is All you Need. In Proceedings of NIPS 2017. 5998–6008.          Cross-Entropy Loss. In Proceedings of AAAI 2019. 7492–7500. https:
     http://papers.nips.cc/paper/7181-attention-is-all-you-need                  //aaai.org/ojs/index.php/AAAI/article/view/4740
[42] Oriol Vinyals and Quoc V. Le. 2015. A Neural Conversational Model.     [49] Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, and Bing
     CoRR abs/1506.05869 (2015). arXiv:1506.05869 http://arxiv.org/abs/          Liu. 2018. Emotional Chatting Machine: Emotional Conversation
     1506.05869                                                                  Generation with Internal and External Memory. In Proceedings of AAAI
[43] Amy Beth Warriner, Victor Kuperman, and Marc Brysbaert. 2013.               2018. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/
     Norms of Valence, Arousal, and Dominance for 13,915 English Lemmas.         16455
     Behavior research methods 45, 4 (2013), 1191–1207.                     [50] Xianda Zhou and William Yang Wang. 2018. MojiTalk: Generating
[44] Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and              Emotional Responses at Scale. In Proceedings of ACL 2018. 1128–1137.
     Wei-Ying Ma. 2017. Topic Aware Neural Response Generation. In               https://doi.org/10.18653/v1/P18-1104