Automatically Predicting User Ratings for Conversational Systems A. Cervone1 , E. Gambi1 , G. Tortoreto2 , E.A. Stepanov2 , G. Riccardi1 1 Signals and Interactive Systems Lab, University of Trento, Trento, Italy 2 VUI, Inc., Trento, Italy {alessandra.cervone, enrico.gambi, giuseppe.riccardi}@unitn.it, {eas,gtr}@vui.com Abstract utilizzato è composto da conversazioni parlate open-domain tra l’agente conver- English. Automatic evaluation models sazionale Roving Mind (parte della com- for open-domain conversational agents ei- petizione Amazon Alexa Prize 2017) e ther correlate poorly with human judg- utenti di Amazon Alexa valutate con pun- ment or require expensive annotations on teggi da 1 a 5. In primo luogo, valutiamo top of conversation scores. In this work la complessità del task assegnando a due we investigate the feasibility of learning esperti il compito di riannotare una parte evaluation models without relying on any del corpus e osserviamo come esso risulti further annotations besides conversation- complesso perfino per annotatori umani level human ratings. We use a dataset of data la sua soggettività. In secondo luogo, rated (1-5) open domain spoken conver- tramite un’analisi condotta sull’intero sations between the conversational agent corpus mostriamo come features estratte Roving Mind (competing in the Amazon automaticamente (sentimento dell’utente, Alexa Prize Challenge 2017) and Amazon Dialogue Acts e lunghezza della conver- Alexa users. First, we assess the com- sazione) hanno bassa, ma significativa plexity of the task by asking two experts correlazione con il giudizio degli utenti. to re-annotate a sample of the dataset and Infine, riportiamo i risultati di esperi- observe that the subjectivity of user rat- menti volti a esplorare diverse combi- ings yields a low upper-bound. Second, nazioni di queste features per addestrare through an analysis of the entire dataset we modelli di valutazione automatica del di- show that automatically extracted features alogo. Questo lavoro mostra la difficoltà such as user sentiment, Dialogue Acts and del predire i giudizi soggettivi degli utenti conversation length have significant, but in conversazioni senza un task specifico. low correlation with user ratings. Finally, we report the results of our experiments exploring different combinations of these 1 Introduction features to train automatic dialogue evalu- ation models. Our work suggests that pre- We are currently witnessing a proliferation of con- dicting subjective user ratings in open do- versational agents in both industry and academia. main conversations is a challenging task. Nevertheless, core questions regarding this tech- nology remain to be addressed or analysed in Italiano. I modelli stato dell’arte per la greater depth. This work focuses on one such valutazione automatica di agenti conver- question: can we automatically predict user rat- sazionali open-domain hanno una scarsa ings of a dialogue with a conversational agent? correlazione con il giudizio umano op- Metrics for task-based systems are generally pure richiedono costose annotazioni oltre related to the successful completion of the task. al punteggio dato alla conversazione. In Among these, contextual appropriateness (Danieli questo lavoro investighiamo la possibilità and Gerbino, 1995) evaluates, for example, the di apprendere modelli di valutazione at- degree of contextual coherence of machine turns traverso il solo utilizzo di punteggi umani with respect to user queries which are classified dati all’intera conversazione. Il corpus with ternary values for slots (appropriate, inappro- priate, and ambiguous). The approach is some- predict user ratings in such context. what similar to the attribute-value matrix of the In order to do so, we utilize a dataset of non- popular PARADISE dialog evaluation framework task based spoken conversations between Ama- (Walker et al., 1997), where there are matrices rep- zon Alexa users and Roving Mind (Cervone et al., resenting the information exchange requirements 2017), our open-domain system for the Amazon between the machine and users towards solving Alexa Prize Challenge 2017 (Ram et al., 2017). the dialog task, as a measure of task success rate. As an upper bound for the rating prediction task, Unlike task-based systems, non-task-based con- we re-annotate a sample of the corpus using ex- versational agents (also known as chitchat mod- perts and analyse the correlation between expert els) do not have a specific task to accomplish (e.g. and user ratings. Afterwards, we analyse the en- booking a restaurant). The goal of these can ar- tire corpus using well-known automatically ex- guably be defined as the conversation itself, i.e. tractable features (user sentiment, Dialogue Acts the entertainment of the human it is conversing (both user and machine), conversation length and with. Thus, human judgment is still the most re- average user turn length), which show a low, but liable evaluation tool we have for such conversa- still significant correlation with user ratings. We tional agents. Collecting user ratings for a system, show how different combinations of these fea- however, is expensive and time-consuming. tures together with a LSA representation of the user turns can be used to train a regression model In order to deal with these issues, researchers whose predictions also yield a low, but significant have been investigating automatic metrics for non- correlation with user ratings. Our results indicate task based dialogue evaluation. The most popu- the difficulty of predicting how users might rate lar of these metrics (e.g. BLEU (Papineni et al., interactions with a conversational agent. 2002), METEOR (Banerjee and Lavie, 2005)) rely on surface text similarity (word overlaps) between 2 Data Collection machine and reference responses to the same ut- terances. Notwithstanding their popularity, such The dataset analysed in this paper was collected metrics are hardly compatible with the nature of over a period of 27 days during the Alexa Prize human dialogue, since there could be multiple ap- 2017 semifinals and consists of conversations be- propriate responses to the same utterance with no tween our system Roving Mind and Amazon word overlap. Moreover, these metrics correlate Alexa users of the United States. The users could weakly with human judgments (Liu et al., 2016). end the conversation whenever they wanted, using Recently, a few studies proposed metrics hav- a command. At the end of the interaction users ing a better correlation with human judgment. were asked to rate a conversation on a 1 (not sat- ADEM (Lowe et al., 2017) is a model trained on isfied at all) to 5 (very satisfied) Likert scale. Out appropriateness scores manually annotated at the of all the rated conversations, we selected the ones response-level. Venkatesh et al. (2017) and Guo longer than 3 turns to yield 4,967 conversations. et al. (2017) combine multiple metrics, each cap- Figure 1 shows the distribution (in percentages) turing a different aspect of the interaction, and of the ratings in our dataset. The large majority of predict conversation-level ratings. In particular, conversations are between a system and a “first- Venkatesh et al. (2017) shows the importance of time” users, as only 5.25% of users had more than metrics such as coherence, conversational depth one conversation. and topic diversity, while Guo et al. (2017) pro- poses topic-based metrics. However, these stud- 3 Methodology ies require extensive manual annotation on top of In this section we describe conversation represen- conversation-level ratings. tation features, experimentation, and evaluation In this work, we investigate non-task based di- methodologies used in the paper. alogue evaluation models trained without relying on any further annotations besides conversation- 3.1 Conversation Representation Features level user ratings. Our goal is twofold: investigat- Since in the competition the objective of the sys- ing conversation features which characterize good tem was to entertain users, we expect the ratings interactions with a conversational agent and ex- to reflect how much they have enjoyed the inter- ploring the feasibility of training a model able to action. User “enjoyment” can be approximated 40 (LSA) is used to convert a conversation to a vec- 36 Users 34 tor. First, we construct a word-document co- Experts Conversation Volume (%) 31 32 occurrence matrix and normalize it. Then, we re- 30 28 All ratings duce the dimensionality to 100 by applying Singu- 23 lar Value Decomposition (SVD). 20 18 16 3.2 Correlation Analysis Methodology 14 14 13 11 11 10 9 The two widely used correlation metrics are Pear- 7 son correlation coefficient (PCC) and Spearman’s rank correlation coefficient (SRCC). While the 1 2 3 4 5 former evaluates the linear relationship between Rating variables, the latter evaluates the monotonic one. The metrics are used to assess correlations of Figure 1: Distribution of user and expert ratings different conversation features, such as sentiment on the annotated random sample of 100 conversa- score or conversation length, with the provided hu- tions (test set) compared to the distribution of rat- man ratings for those conversations; as well as to ings in the entire dataset (“All ratings”). For clar- assess the correlation of the predicted scores of the ity of presentation, from the latter we excluded the regression models to those ratings. For the assess- small portion of non integer ratings (2.3% of the ment of the correlation of both features and regres- dataset). sion models raw rating predictions are used. using different metrics that do not require manual 3.3 Prediction Methodology annotation, such as conversation length (in turns), mean turn length (in words), assuming that the Using the conversation features described above, more users enjoy the conversation the longer they we train regression models to predict human rat- talk; sentiment polarity – hypothesizing that en- ings. We experiment with both Linear Regression joyable conversations should carry a more posi- and Support Vector Regression (SVR) with radial tive sentiment. While length metrics are straight- basis function (RBF) kernel using scikit-learn (Pe- forward to compute, the sentiment score is com- dregosa et al., 2011). Since the latter consistently puted using a lexicon-based approach (Kennedy outperforms the former, we report only the results and Inkpen, 2006). for the SVR. The performance of the regression Another representation that could shed a light models is evaluated using the standard metrics on enjoyable conversations is Dialogue Acts (DA) of Root Mean Squared Error (RMSE) and Mean of user and machine utterances. DAs are fre- Absolute Error (MAE). Additionally, we compute quently used as a generic representation of intents Pearson and Spearman’s Rank Correlation Coeffi- and the considered labels often include thanking, cients for the predictions with respect to the refer- apologies, opinions, statements and alike. Rela- ence human ratings. tive frequencies of these tags potentially can be We experiment with the 10-fold cross- useful to distinguish good and bad conversations. validation setting. The performance of the The DA tagger we use is the one described in regression models is compared to two baselines: Mezza et al. (2018) trained on the Switchboard Di- (1) mean baseline, where all instances in the alogue Acts corpus (Stolcke et al., 2000), a subset testing fold are assigned as a score the mean of of Switchboard (Godfrey et al., 1992) annotated the training set ratings, and (2) chance baseline, with DAs (42 categories), using Support Vector where an instance is randomly assigned a rating Machines. The user and machine DAs are con- from 1 to 5 with respect to their distribution in sidered as separate vectors and assessed both indi- the training set. The models are compared for vidually and jointly. statistical significance to these baselines using Additional to Dialogue Acts, sentiment and paired two-tail T-test with p < 0.05. In Section length features, we experiment with word-based 6 we report average RMSE and MAE as well as text representation. Latent Semantic Analysis average correlation coefficients. RMSE MAE PCC SRCC Feature PCC SRCC Exp 1 vs. Exp 2 0.875 0.660 0.705 0.694 Conversation Length 0.133** 0.111** Exp 1 vs. Users 1.225 0.966 0.538 0.526 Av. User Turn Length -0.068** -0.079** Exp 2 vs. Users 1.286 1.016 0.401 0.370 User Sentiment 0.071** 0.088** User Dialogue Acts Table 1: Root Mean Squared Error (RMSE), Mean yes-answer 0.081** 0.088** appreciation 0.070** 0.115** Absolute Error (MAE), Pearson (PCC) and Spear- thanking 0.062** 0.089** man’s rank (SRCC) correlation coefficients among action-directive -0.069** -0.052** user and expert ratings. statement-non-opinion 0.050** 0.037** ... Machine Dialogue Acts yes-no-question 0.042** 0.038** 4 Upper bound statement-opinion -0.027** -0.032** ... Since human ratings are inherently subjective, and different users can rate the same conversation dif- Table 2: Pearson (PCC) and Spearman’s rank ferently, it is difficult to expect the models to yield (SRCC) correlation coefficients for conversation perfect correlations or very low RMSE and MAE. lengths, sentiment score, and user and machine In order to test this hypothesis two human experts Dialogue Acts. Correlations significant with p < (members of our Alexa Prize team) were asked to 0.05 are marked with * and p < 0.01 with **. rate a random subset of the corpus (100 conver- sations). The rating distributions for both experts and users on the sample is reported in Figure 1. Due to the space considerations, we report only We observe that expert ratings tend to be closer to a portion of the DAs that have significant correla- the middle of the Likert scale (i.e. from 2 to 4), tions with human ratings. The analysis confirms while users had more conversations with ratings at our expectations that user DAs, such as thanking both extremes of the scale (i.e. 1 and 5). and appreciation, have significant positive corre- The RMSE, MAE and Pearson and Spearman’s lations. We also observe that the action-directive rank correlation coefficients of expert and user rat- DA has a negative correlation. Since this DA label ings are reported in Table 1. We observe that covers the turns where a user issues control com- the experts tend to agree with each other more mands to the system, we hypothesize this corre- than they agree individually with users, since com- lation could be due to the fact that in such cases pared to each other the experts have the highest users were using a task-based approach with our Pearson and Spearman correlation scores (0.705 system which was instead designed for chitchat and 0.694, respectively) and the lowest RMSE and and might therefore feel disappointed (e.g. re- MAE (0.875 and 0.660, respectively). The fact questing the Roving Mind system to perform ac- that expert ratings do not correlate with user rat- tions it was not designed to perform, such as play- ings as well as they correlate among themselves, ing music). confirms the difficulty of the task of predicting Regarding machine DAs, we observe that even subjective user ratings even for humans. though some DAs exhibit significant correlations, overall they are lower than user DAs. In particular, 5 Correlation Analysis Results yes-no-question has a significant positive correla- tion with human judgments, indicating that some The results of the correlation analysis are reported users appreciate machine initiative in the conver- in Table 2. From the table, we can observe sation. The analysis confirms the utility of length that conversation length has a positive correlation and sentiment features, as well as the importance with human judgment, while the average user turn of some DAs (generic intents) for estimating user length has a negative correlation. The positive cor- ratings. relation with conversation length confirms the ex- pectation that users tend to have longer conversa- 6 Prediction Results tions with the system when they enjoy it. The neg- ative correlation with average user turn length, on The results of the experiments using 10-fold cross- the other hand, is unexpected. As expected, sen- validation and Support Vector Regression are re- timent score has a significant positive correlation ported in Table 3. We report performances of each with human judgments. feature representation is isolation and their combi- RMSE MAE PCC SRCC BL: Chance 1.967* 1.535* 0.007** 0.023** BL: Mean 1.382* 1.189* N/A N/A Lengths 1.400* 1.116* 0.153** 0.158** Sentiment 1.423* 1.128* 0.109** 0.122** DA: user 1.378* 1.106* 0.213** 0,207** DA: machine 1.418* 1.129* 0.104** 0.099** DA: user+machine 1.375* 1.106* 0.219** 0.211** LSA 1.350* 1.075* 0.299** 0.288** All - LSA 1.366* 1.100* 0.240** 0.230** All 1.350* 1.078* 0.303** 0.290** Table 3: 10 fold cross-validation average Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Pearson (PCC) and Spearman’s rank (SRCC) correlation coefficients for regression models. RMSE and MAE significantly better than the baselines are marked with *. Correlations significant with p < 0.05 are marked with * and p < 0.01 with **. nations. We consider two baselines – chance and achieves a little worse MAE score. While it yields mean. For the chance baseline an instance is ran- the best Pearson and Spearman’s rank correlation domly assigned a rating with respect to the train- coefficients among all the models, the difference ing set distribution. For the mean baseline, on the from LSA only model is not statistically relevant other hand, all the instances are assigned the mean using Fisher r-to-z transformation. of the training set as a rating. The mean base- line yields better RMSE and MAE scores; conse- 7 Conclusions quently, we compare the regression models to it. In this work we experimented with a set of au- Sentiment and length features (conversation and tomatically extractable black-box features which average user turn) both yield RMSE higher than correlate with the human perception of the quality the mean baseline and MAE significantly lower of interactions with a conversational agent. Fur- than it. Nonetheless, their predictions have sig- thermore, we showed how these features can be nificant positive correlations with reference hu- combined to train automatic non-task-based dia- man ratings. The picture is similar for the mod- logue evaluation models which correlate with hu- els trained on user and machine DAs alone and man judgments without further expensive annota- their combination. The RMSE scores are higher tions. or insignificantly lower and MAE scores are sig- The results of our experiments and analysis con- nificantly lower than the mean baseline. tribute to the body of observations that indicate For the LSA representation of conversations we that there still remains a lot of research to be done consider ngram sizes between 1 and 4. The repre- in order to understand characteristics of enjoyable sentation that considers 4-grams and the SVD di- conversations with open-domain non-task oriented mension of 100 yields better performances; thus, agents. In particular, our analysis of expert vs. we report the performances of this models only, user ratings suggests that the task of estimating and use it for feature combination experiments. subjective user ratings is a difficult one, since the The LSA model yields significantly lower error same conversation might be rated quite differently. both in terms of RMSE and MAE. Additionally, For the future work, we plan to extend our cor- the correlation of the predictions is higher than for pus to include interactions with multiple conversa- the other features (and combinations). tional agents and task-based systems, as well as to explore other features that might be relevant for as- The regression model trained on all features but sessing human judgment of interaction with a con- LSA, yields performances significantly better than versational agent (e.g. emotion recognition). the mean baseline. However, they are inferior to that of LSA alone. Combination of all the fea- tures retains the best RMSE of the LSA model, but References the 40th annual meeting on association for compu- tational linguistics, pages 311–318. Association for Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An Computational Linguistics. automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, of the acl workshop on intrinsic and extrinsic evalu- B. Thirion, O. Grisel, M. Blondel, P. Pretten- ation measures for machine translation and/or sum- hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas- marization, volume 29, pages 65–72. sos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learn- Alessandra Cervone, Giuliano Tortoreto, Stefano ing in Python. Journal of Machine Learning Re- Mezza, Enrico Gambi, and Giuseppe Riccardi. search. 2017. Roving mind: a balancing act between open– domain and engaging dialogue systems. In Alexa Ashwin Ram, Rohit Prasad, Chandra Khatri, Anu Prize Proceedings. Venkatesh, Raefer Gabriel, Qing Liu, Jeff Nunn, Behnam Hedayatnia, Ming Cheng, Ashish Nagar, Morena Danieli and Elisabetta Gerbino. 1995. Metrics Eric King, Kate Bland, Amanda Wartick, Yi Pan, for evaluating dialogue strategies in a spoken lan- Han Song, Sk Jayadevan, Gene Hwang, and Art Pet- guage system. In Proceedings of the 1995 AAAI tigrue. 2017. Conversational ai: The science behind spring symposium on Empirical Methods in Dis- the alexa prize. In Alexa Prize Proceedings. course Interpretation and Generation, volume 16, pages 34–39. Andreas Stolcke, Klaus Ries, Noah Coccaro, Eliza- beth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul John J Godfrey, Edward C Holliman, and Jane Mc- Taylor, Rachel Martin, Carol Van Ess-Dykema, and Daniel. 1992. Switchboard: Telephone speech cor- Marie Meteer. 2000. Dialogue act modeling for pus for research and development. In Acoustics, automatic tagging and recognition of conversational Speech, and Signal Processing, 1992. ICASSP-92., speech. Computational linguistics, 26(3):339–373. 1992 IEEE International Conference on, volume 1, pages 517–520. IEEE. Anu Venkatesh, Chandra Khatri, Ashwin Ram, Fenfei Guo, Raefer Gabriel, Ashish Nagar, Rohit Prasad, Fenfei Guo, Angeliki Metallinou, Chandra Khatri, Ming Cheng, Benham Hedayatnia, Angeliki Met- Anirudh Raju, Anu Venkatesh, and Ashwin Ram. allinou, Rahul Goel, Shaohua Yang, and Anirudh 2017. Topic-based evaluation for conversational Raju. 2017. On evaluating and comparing con- bots. In NIPS 2017 Conversational AI workshop. versational agents. In NIPS 2017 Conversational AI Alistair Kennedy and Diana Inkpen. 2006. Senti- workshop. ment classification of movie reviews using contex- Marilyn A Walker, Diane J Litman, Candace A Kamm, tual valence shifters. Computational intelligence, and Alicia Abella. 1997. Paradise: A framework 22(2):110–125. for evaluating spoken dialogue agents. In Proceed- Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Nose- ings of the eighth conference on European chap- worthy, Laurent Charlin, and Joelle Pineau. 2016. ter of the Association for Computational Linguistics, How not to evaluate your dialogue system: An em- pages 271–280. Association for Computational Lin- pirical study of unsupervised evaluation metrics for guistics. dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2122–2132. Ryan Lowe, Michael Noseworthy, Iulian Vlad Ser- ban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. 2017. Towards an automatic turing test: Learning to evaluate dialogue responses. In Proceedings of the 55th Annual Meeting of the As- sociation for Computational Linguistics, volume 1, pages 1116–1126. Stefano Mezza, Alessandra Cervone, Giuliano Tor- toreto, Evgeny A. Stepanov, and Giuseppe Riccardi. 2018. Iso-standard domain-independent dialogue act tagging for conversational agents. In Proceed- ings of COLING 2018, the 27th International Con- ference on Computational Linguistics: Technical Papers, pages 3539–3551. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of