Automatically Predicting User Ratings for Conversational Systems

       A. Cervone1 , E. Gambi1 , G. Tortoreto2 , E.A. Stepanov2 , G. Riccardi1
       1
         Signals and Interactive Systems Lab, University of Trento, Trento, Italy
                                2
                                  VUI, Inc., Trento, Italy
{alessandra.cervone, enrico.gambi, giuseppe.riccardi}@unitn.it,
                                {eas,gtr}@vui.com


                  Abstract                           utilizzato è composto da conversazioni
                                                     parlate open-domain tra l’agente conver-
  English. Automatic evaluation models               sazionale Roving Mind (parte della com-
  for open-domain conversational agents ei-          petizione Amazon Alexa Prize 2017) e
  ther correlate poorly with human judg-             utenti di Amazon Alexa valutate con pun-
  ment or require expensive annotations on           teggi da 1 a 5. In primo luogo, valutiamo
  top of conversation scores. In this work           la complessità del task assegnando a due
  we investigate the feasibility of learning         esperti il compito di riannotare una parte
  evaluation models without relying on any           del corpus e osserviamo come esso risulti
  further annotations besides conversation-          complesso perfino per annotatori umani
  level human ratings. We use a dataset of           data la sua soggettività. In secondo luogo,
  rated (1-5) open domain spoken conver-             tramite un’analisi condotta sull’intero
  sations between the conversational agent           corpus mostriamo come features estratte
  Roving Mind (competing in the Amazon               automaticamente (sentimento dell’utente,
  Alexa Prize Challenge 2017) and Amazon             Dialogue Acts e lunghezza della conver-
  Alexa users. First, we assess the com-             sazione) hanno bassa, ma significativa
  plexity of the task by asking two experts          correlazione con il giudizio degli utenti.
  to re-annotate a sample of the dataset and         Infine, riportiamo i risultati di esperi-
  observe that the subjectivity of user rat-         menti volti a esplorare diverse combi-
  ings yields a low upper-bound. Second,             nazioni di queste features per addestrare
  through an analysis of the entire dataset we       modelli di valutazione automatica del di-
  show that automatically extracted features         alogo. Questo lavoro mostra la difficoltà
  such as user sentiment, Dialogue Acts and          del predire i giudizi soggettivi degli utenti
  conversation length have significant, but          in conversazioni senza un task specifico.
  low correlation with user ratings. Finally,
  we report the results of our experiments
  exploring different combinations of these      1   Introduction
  features to train automatic dialogue evalu-
  ation models. Our work suggests that pre-      We are currently witnessing a proliferation of con-
  dicting subjective user ratings in open do-    versational agents in both industry and academia.
  main conversations is a challenging task.      Nevertheless, core questions regarding this tech-
                                                 nology remain to be addressed or analysed in
  Italiano. I modelli stato dell’arte per la     greater depth. This work focuses on one such
  valutazione automatica di agenti conver-       question: can we automatically predict user rat-
  sazionali open-domain hanno una scarsa         ings of a dialogue with a conversational agent?
  correlazione con il giudizio umano op-            Metrics for task-based systems are generally
  pure richiedono costose annotazioni oltre      related to the successful completion of the task.
  al punteggio dato alla conversazione. In       Among these, contextual appropriateness (Danieli
  questo lavoro investighiamo la possibilità    and Gerbino, 1995) evaluates, for example, the
  di apprendere modelli di valutazione at-       degree of contextual coherence of machine turns
  traverso il solo utilizzo di punteggi umani    with respect to user queries which are classified
  dati all’intera conversazione. Il corpus       with ternary values for slots (appropriate, inappro-
priate, and ambiguous). The approach is some-          predict user ratings in such context.
what similar to the attribute-value matrix of the         In order to do so, we utilize a dataset of non-
popular PARADISE dialog evaluation framework           task based spoken conversations between Ama-
(Walker et al., 1997), where there are matrices rep-   zon Alexa users and Roving Mind (Cervone et al.,
resenting the information exchange requirements        2017), our open-domain system for the Amazon
between the machine and users towards solving          Alexa Prize Challenge 2017 (Ram et al., 2017).
the dialog task, as a measure of task success rate.    As an upper bound for the rating prediction task,
   Unlike task-based systems, non-task-based con-      we re-annotate a sample of the corpus using ex-
versational agents (also known as chitchat mod-        perts and analyse the correlation between expert
els) do not have a specific task to accomplish (e.g.   and user ratings. Afterwards, we analyse the en-
booking a restaurant). The goal of these can ar-       tire corpus using well-known automatically ex-
guably be defined as the conversation itself, i.e.     tractable features (user sentiment, Dialogue Acts
the entertainment of the human it is conversing        (both user and machine), conversation length and
with. Thus, human judgment is still the most re-       average user turn length), which show a low, but
liable evaluation tool we have for such conversa-      still significant correlation with user ratings. We
tional agents. Collecting user ratings for a system,   show how different combinations of these fea-
however, is expensive and time-consuming.              tures together with a LSA representation of the
                                                       user turns can be used to train a regression model
   In order to deal with these issues, researchers
                                                       whose predictions also yield a low, but significant
have been investigating automatic metrics for non-
                                                       correlation with user ratings. Our results indicate
task based dialogue evaluation. The most popu-
                                                       the difficulty of predicting how users might rate
lar of these metrics (e.g. BLEU (Papineni et al.,
                                                       interactions with a conversational agent.
2002), METEOR (Banerjee and Lavie, 2005)) rely
on surface text similarity (word overlaps) between     2     Data Collection
machine and reference responses to the same ut-
terances. Notwithstanding their popularity, such       The dataset analysed in this paper was collected
metrics are hardly compatible with the nature of       over a period of 27 days during the Alexa Prize
human dialogue, since there could be multiple ap-      2017 semifinals and consists of conversations be-
propriate responses to the same utterance with no      tween our system Roving Mind and Amazon
word overlap. Moreover, these metrics correlate        Alexa users of the United States. The users could
weakly with human judgments (Liu et al., 2016).        end the conversation whenever they wanted, using
   Recently, a few studies proposed metrics hav-       a command. At the end of the interaction users
ing a better correlation with human judgment.          were asked to rate a conversation on a 1 (not sat-
ADEM (Lowe et al., 2017) is a model trained on         isfied at all) to 5 (very satisfied) Likert scale. Out
appropriateness scores manually annotated at the       of all the rated conversations, we selected the ones
response-level. Venkatesh et al. (2017) and Guo        longer than 3 turns to yield 4,967 conversations.
et al. (2017) combine multiple metrics, each cap-      Figure 1 shows the distribution (in percentages)
turing a different aspect of the interaction, and      of the ratings in our dataset. The large majority of
predict conversation-level ratings. In particular,     conversations are between a system and a “first-
Venkatesh et al. (2017) shows the importance of        time” users, as only 5.25% of users had more than
metrics such as coherence, conversational depth        one conversation.
and topic diversity, while Guo et al. (2017) pro-
poses topic-based metrics. However, these stud-
                                                       3     Methodology
ies require extensive manual annotation on top of      In this section we describe conversation represen-
conversation-level ratings.                            tation features, experimentation, and evaluation
   In this work, we investigate non-task based di-     methodologies used in the paper.
alogue evaluation models trained without relying
on any further annotations besides conversation-       3.1    Conversation Representation Features
level user ratings. Our goal is twofold: investigat-   Since in the competition the objective of the sys-
ing conversation features which characterize good      tem was to entertain users, we expect the ratings
interactions with a conversational agent and ex-       to reflect how much they have enjoyed the inter-
ploring the feasibility of training a model able to    action. User “enjoyment” can be approximated
                           40                                                                (LSA) is used to convert a conversation to a vec-
                                             36                        Users
                                       34                                                    tor. First, we construct a word-document co-
                                                                       Experts
 Conversation Volume (%)
                                            31             32                                occurrence matrix and normalize it. Then, we re-
                           30 28                                       All ratings
                                                                                             duce the dimensionality to 100 by applying Singu-
                                                     23                                      lar Value Decomposition (SVD).
                           20                                  18
                                                          16                                 3.2   Correlation Analysis Methodology
                                14                                  14                  13
                                                                      11       11
                           10                                              9                 The two widely used correlation metrics are Pear-
                                                                                    7
                                                                                             son correlation coefficient (PCC) and Spearman’s
                                                                                             rank correlation coefficient (SRCC). While the
                                   1             2          3         4             5        former evaluates the linear relationship between
                                                          Rating                             variables, the latter evaluates the monotonic one.
                                                                                                The metrics are used to assess correlations of
Figure 1: Distribution of user and expert ratings                                            different conversation features, such as sentiment
on the annotated random sample of 100 conversa-                                              score or conversation length, with the provided hu-
tions (test set) compared to the distribution of rat-                                        man ratings for those conversations; as well as to
ings in the entire dataset (“All ratings”). For clar-                                        assess the correlation of the predicted scores of the
ity of presentation, from the latter we excluded the                                         regression models to those ratings. For the assess-
small portion of non integer ratings (2.3% of the                                            ment of the correlation of both features and regres-
dataset).                                                                                    sion models raw rating predictions are used.

using different metrics that do not require manual                                           3.3   Prediction Methodology
annotation, such as conversation length (in turns),
mean turn length (in words), assuming that the                                               Using the conversation features described above,
more users enjoy the conversation the longer they                                            we train regression models to predict human rat-
talk; sentiment polarity – hypothesizing that en-                                            ings. We experiment with both Linear Regression
joyable conversations should carry a more posi-                                              and Support Vector Regression (SVR) with radial
tive sentiment. While length metrics are straight-                                           basis function (RBF) kernel using scikit-learn (Pe-
forward to compute, the sentiment score is com-                                              dregosa et al., 2011). Since the latter consistently
puted using a lexicon-based approach (Kennedy                                                outperforms the former, we report only the results
and Inkpen, 2006).                                                                           for the SVR. The performance of the regression
   Another representation that could shed a light                                            models is evaluated using the standard metrics
on enjoyable conversations is Dialogue Acts (DA)                                             of Root Mean Squared Error (RMSE) and Mean
of user and machine utterances. DAs are fre-                                                 Absolute Error (MAE). Additionally, we compute
quently used as a generic representation of intents                                          Pearson and Spearman’s Rank Correlation Coeffi-
and the considered labels often include thanking,                                            cients for the predictions with respect to the refer-
apologies, opinions, statements and alike. Rela-                                             ence human ratings.
tive frequencies of these tags potentially can be                                               We experiment with the 10-fold cross-
useful to distinguish good and bad conversations.                                            validation setting.     The performance of the
The DA tagger we use is the one described in                                                 regression models is compared to two baselines:
Mezza et al. (2018) trained on the Switchboard Di-                                           (1) mean baseline, where all instances in the
alogue Acts corpus (Stolcke et al., 2000), a subset                                          testing fold are assigned as a score the mean of
of Switchboard (Godfrey et al., 1992) annotated                                              the training set ratings, and (2) chance baseline,
with DAs (42 categories), using Support Vector                                               where an instance is randomly assigned a rating
Machines. The user and machine DAs are con-                                                  from 1 to 5 with respect to their distribution in
sidered as separate vectors and assessed both indi-                                          the training set. The models are compared for
vidually and jointly.                                                                        statistical significance to these baselines using
   Additional to Dialogue Acts, sentiment and                                                paired two-tail T-test with p < 0.05. In Section
length features, we experiment with word-based                                               6 we report average RMSE and MAE as well as
text representation. Latent Semantic Analysis                                                average correlation coefficients.
                      RMSE     MAE     PCC     SRCC           Feature                      PCC         SRCC
    Exp 1 vs. Exp 2    0.875   0.660   0.705    0.694         Conversation Length           0.133**      0.111**
    Exp 1 vs. Users    1.225   0.966   0.538    0.526         Av. User Turn Length         -0.068**     -0.079**
    Exp 2 vs. Users    1.286   1.016   0.401    0.370         User Sentiment                0.071**      0.088**
                                                                                User Dialogue Acts
Table 1: Root Mean Squared Error (RMSE), Mean                 yes-answer                    0.081**      0.088**
                                                              appreciation                  0.070**      0.115**
Absolute Error (MAE), Pearson (PCC) and Spear-                thanking                      0.062**      0.089**
man’s rank (SRCC) correlation coefficients among              action-directive             -0.069**     -0.052**
user and expert ratings.                                      statement-non-opinion         0.050**      0.037**
                                                              ...
                                                                               Machine Dialogue Acts
                                                              yes-no-question               0.042**      0.038**
4      Upper bound                                            statement-opinion            -0.027**     -0.032**
                                                              ...
Since human ratings are inherently subjective, and
different users can rate the same conversation dif-
                                                          Table 2: Pearson (PCC) and Spearman’s rank
ferently, it is difficult to expect the models to yield
                                                          (SRCC) correlation coefficients for conversation
perfect correlations or very low RMSE and MAE.
                                                          lengths, sentiment score, and user and machine
In order to test this hypothesis two human experts
                                                          Dialogue Acts. Correlations significant with p <
(members of our Alexa Prize team) were asked to
                                                          0.05 are marked with * and p < 0.01 with **.
rate a random subset of the corpus (100 conver-
sations). The rating distributions for both experts
and users on the sample is reported in Figure 1.             Due to the space considerations, we report only
We observe that expert ratings tend to be closer to       a portion of the DAs that have significant correla-
the middle of the Likert scale (i.e. from 2 to 4),        tions with human ratings. The analysis confirms
while users had more conversations with ratings at        our expectations that user DAs, such as thanking
both extremes of the scale (i.e. 1 and 5).                and appreciation, have significant positive corre-
   The RMSE, MAE and Pearson and Spearman’s               lations. We also observe that the action-directive
rank correlation coefficients of expert and user rat-     DA has a negative correlation. Since this DA label
ings are reported in Table 1. We observe that             covers the turns where a user issues control com-
the experts tend to agree with each other more            mands to the system, we hypothesize this corre-
than they agree individually with users, since com-       lation could be due to the fact that in such cases
pared to each other the experts have the highest          users were using a task-based approach with our
Pearson and Spearman correlation scores (0.705            system which was instead designed for chitchat
and 0.694, respectively) and the lowest RMSE and          and might therefore feel disappointed (e.g. re-
MAE (0.875 and 0.660, respectively). The fact             questing the Roving Mind system to perform ac-
that expert ratings do not correlate with user rat-       tions it was not designed to perform, such as play-
ings as well as they correlate among themselves,          ing music).
confirms the difficulty of the task of predicting            Regarding machine DAs, we observe that even
subjective user ratings even for humans.                  though some DAs exhibit significant correlations,
                                                          overall they are lower than user DAs. In particular,
5      Correlation Analysis Results                       yes-no-question has a significant positive correla-
                                                          tion with human judgments, indicating that some
The results of the correlation analysis are reported
                                                          users appreciate machine initiative in the conver-
in Table 2. From the table, we can observe
                                                          sation. The analysis confirms the utility of length
that conversation length has a positive correlation
                                                          and sentiment features, as well as the importance
with human judgment, while the average user turn
                                                          of some DAs (generic intents) for estimating user
length has a negative correlation. The positive cor-
                                                          ratings.
relation with conversation length confirms the ex-
pectation that users tend to have longer conversa-        6     Prediction Results
tions with the system when they enjoy it. The neg-
ative correlation with average user turn length, on       The results of the experiments using 10-fold cross-
the other hand, is unexpected. As expected, sen-          validation and Support Vector Regression are re-
timent score has a significant positive correlation       ported in Table 3. We report performances of each
with human judgments.                                     feature representation is isolation and their combi-
                              RMSE                  MAE                  PCC                  SRCC
 BL: Chance                      1.967*               1.535*               0.007**               0.023**
 BL: Mean                        1.382*               1.189*                   N/A                   N/A
 Lengths                         1.400*               1.116*               0.153**               0.158**
 Sentiment                       1.423*               1.128*               0.109**               0.122**
 DA: user                        1.378*               1.106*               0.213**               0,207**
 DA: machine                     1.418*               1.129*               0.104**               0.099**
 DA: user+machine                1.375*               1.106*               0.219**               0.211**
 LSA                             1.350*               1.075*               0.299**               0.288**
 All - LSA                       1.366*               1.100*               0.240**               0.230**
 All                             1.350*               1.078*               0.303**               0.290**

Table 3: 10 fold cross-validation average Root Mean Squared Error (RMSE), Mean Absolute Error
(MAE), Pearson (PCC) and Spearman’s rank (SRCC) correlation coefficients for regression models.
RMSE and MAE significantly better than the baselines are marked with *. Correlations significant with
p < 0.05 are marked with * and p < 0.01 with **.


nations. We consider two baselines – chance and         achieves a little worse MAE score. While it yields
mean. For the chance baseline an instance is ran-       the best Pearson and Spearman’s rank correlation
domly assigned a rating with respect to the train-      coefficients among all the models, the difference
ing set distribution. For the mean baseline, on the     from LSA only model is not statistically relevant
other hand, all the instances are assigned the mean     using Fisher r-to-z transformation.
of the training set as a rating. The mean base-
line yields better RMSE and MAE scores; conse-          7   Conclusions
quently, we compare the regression models to it.        In this work we experimented with a set of au-
   Sentiment and length features (conversation and      tomatically extractable black-box features which
average user turn) both yield RMSE higher than          correlate with the human perception of the quality
the mean baseline and MAE significantly lower           of interactions with a conversational agent. Fur-
than it. Nonetheless, their predictions have sig-       thermore, we showed how these features can be
nificant positive correlations with reference hu-       combined to train automatic non-task-based dia-
man ratings. The picture is similar for the mod-        logue evaluation models which correlate with hu-
els trained on user and machine DAs alone and           man judgments without further expensive annota-
their combination. The RMSE scores are higher           tions.
or insignificantly lower and MAE scores are sig-           The results of our experiments and analysis con-
nificantly lower than the mean baseline.                tribute to the body of observations that indicate
  For the LSA representation of conversations we        that there still remains a lot of research to be done
consider ngram sizes between 1 and 4. The repre-        in order to understand characteristics of enjoyable
sentation that considers 4-grams and the SVD di-        conversations with open-domain non-task oriented
mension of 100 yields better performances; thus,        agents. In particular, our analysis of expert vs.
we report the performances of this models only,         user ratings suggests that the task of estimating
and use it for feature combination experiments.         subjective user ratings is a difficult one, since the
The LSA model yields significantly lower error          same conversation might be rated quite differently.
both in terms of RMSE and MAE. Additionally,               For the future work, we plan to extend our cor-
the correlation of the predictions is higher than for   pus to include interactions with multiple conversa-
the other features (and combinations).                  tional agents and task-based systems, as well as to
                                                        explore other features that might be relevant for as-
   The regression model trained on all features but
                                                        sessing human judgment of interaction with a con-
LSA, yields performances significantly better than
                                                        versational agent (e.g. emotion recognition).
the mean baseline. However, they are inferior to
that of LSA alone. Combination of all the fea-
tures retains the best RMSE of the LSA model, but
References                                                  the 40th annual meeting on association for compu-
                                                            tational linguistics, pages 311–318. Association for
Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An         Computational Linguistics.
  automatic metric for mt evaluation with improved
  correlation with human judgments. In Proceedings        F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
  of the acl workshop on intrinsic and extrinsic evalu-      B. Thirion, O. Grisel, M. Blondel, P. Pretten-
  ation measures for machine translation and/or sum-         hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-
  marization, volume 29, pages 65–72.                        sos, D. Cournapeau, M. Brucher, M. Perrot, and
                                                             E. Duchesnay. 2011. Scikit-learn: Machine learn-
Alessandra Cervone, Giuliano Tortoreto, Stefano              ing in Python. Journal of Machine Learning Re-
  Mezza, Enrico Gambi, and Giuseppe Riccardi.                search.
  2017. Roving mind: a balancing act between open–
  domain and engaging dialogue systems. In Alexa          Ashwin Ram, Rohit Prasad, Chandra Khatri, Anu
  Prize Proceedings.                                        Venkatesh, Raefer Gabriel, Qing Liu, Jeff Nunn,
                                                            Behnam Hedayatnia, Ming Cheng, Ashish Nagar,
Morena Danieli and Elisabetta Gerbino. 1995. Metrics        Eric King, Kate Bland, Amanda Wartick, Yi Pan,
 for evaluating dialogue strategies in a spoken lan-        Han Song, Sk Jayadevan, Gene Hwang, and Art Pet-
 guage system. In Proceedings of the 1995 AAAI              tigrue. 2017. Conversational ai: The science behind
 spring symposium on Empirical Methods in Dis-              the alexa prize. In Alexa Prize Proceedings.
 course Interpretation and Generation, volume 16,
 pages 34–39.                                             Andreas Stolcke, Klaus Ries, Noah Coccaro, Eliza-
                                                            beth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul
John J Godfrey, Edward C Holliman, and Jane Mc-             Taylor, Rachel Martin, Carol Van Ess-Dykema, and
  Daniel. 1992. Switchboard: Telephone speech cor-          Marie Meteer. 2000. Dialogue act modeling for
  pus for research and development. In Acoustics,           automatic tagging and recognition of conversational
  Speech, and Signal Processing, 1992. ICASSP-92.,          speech. Computational linguistics, 26(3):339–373.
  1992 IEEE International Conference on, volume 1,
  pages 517–520. IEEE.                                    Anu Venkatesh, Chandra Khatri, Ashwin Ram, Fenfei
                                                            Guo, Raefer Gabriel, Ashish Nagar, Rohit Prasad,
Fenfei Guo, Angeliki Metallinou, Chandra Khatri,            Ming Cheng, Benham Hedayatnia, Angeliki Met-
  Anirudh Raju, Anu Venkatesh, and Ashwin Ram.              allinou, Rahul Goel, Shaohua Yang, and Anirudh
  2017. Topic-based evaluation for conversational           Raju. 2017. On evaluating and comparing con-
  bots. In NIPS 2017 Conversational AI workshop.            versational agents. In NIPS 2017 Conversational AI
Alistair Kennedy and Diana Inkpen. 2006. Senti-             workshop.
  ment classification of movie reviews using contex-      Marilyn A Walker, Diane J Litman, Candace A Kamm,
  tual valence shifters. Computational intelligence,       and Alicia Abella. 1997. Paradise: A framework
  22(2):110–125.                                           for evaluating spoken dialogue agents. In Proceed-
Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Nose-         ings of the eighth conference on European chap-
  worthy, Laurent Charlin, and Joelle Pineau. 2016.        ter of the Association for Computational Linguistics,
  How not to evaluate your dialogue system: An em-         pages 271–280. Association for Computational Lin-
  pirical study of unsupervised evaluation metrics for     guistics.
  dialogue response generation. In Proceedings of the
  2016 Conference on Empirical Methods in Natural
  Language Processing, pages 2122–2132.
Ryan Lowe, Michael Noseworthy, Iulian Vlad Ser-
  ban, Nicolas Angelard-Gontier, Yoshua Bengio, and
  Joelle Pineau. 2017. Towards an automatic turing
  test: Learning to evaluate dialogue responses. In
  Proceedings of the 55th Annual Meeting of the As-
  sociation for Computational Linguistics, volume 1,
  pages 1116–1126.
Stefano Mezza, Alessandra Cervone, Giuliano Tor-
   toreto, Evgeny A. Stepanov, and Giuseppe Riccardi.
   2018. Iso-standard domain-independent dialogue
   act tagging for conversational agents. In Proceed-
   ings of COLING 2018, the 27th International Con-
   ference on Computational Linguistics: Technical
   Papers, pages 3539–3551.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
  Jing Zhu. 2002. Bleu: a method for automatic
  evaluation of machine translation. In Proceedings of