=Paper= {{Paper |id=Vol-2253/paper32 |storemode=property |title=Automatically Predicting User Ratings for Conversational Systems |pdfUrl=https://ceur-ws.org/Vol-2253/paper32.pdf |volume=Vol-2253 |authors=Alessandra Cervone,Enrico Gambi,Giuliano Tortoreto,Evgeny Stepanov,Giuseppe Riccardi |dblpUrl=https://dblp.org/rec/conf/clic-it/CervoneGTSR18 }} ==Automatically Predicting User Ratings for Conversational Systems== https://ceur-ws.org/Vol-2253/paper32.pdf

Automatically Predicting User Ratings for Conversational Systems

A. Cervone1 , E. Gambi1 , G. Tortoreto2 , E.A. Stepanov2 , G. Riccardi1
1
Signals and Interactive Systems Lab, University of Trento, Trento, Italy
2
VUI, Inc., Trento, Italy
{alessandra.cervone, enrico.gambi, giuseppe.riccardi}@unitn.it,
{eas,gtr}@vui.com

Abstract utilizzato è composto da conversazioni
parlate open-domain tra l’agente conver-
English. Automatic evaluation models sazionale Roving Mind (parte della com-
for open-domain conversational agents ei- petizione Amazon Alexa Prize 2017) e
ther correlate poorly with human judg- utenti di Amazon Alexa valutate con pun-
ment or require expensive annotations on teggi da 1 a 5. In primo luogo, valutiamo
top of conversation scores. In this work la complessità del task assegnando a due
we investigate the feasibility of learning esperti il compito di riannotare una parte
evaluation models without relying on any del corpus e osserviamo come esso risulti
further annotations besides conversation- complesso perfino per annotatori umani
level human ratings. We use a dataset of data la sua soggettività. In secondo luogo,
rated (1-5) open domain spoken conver- tramite un’analisi condotta sull’intero
sations between the conversational agent corpus mostriamo come features estratte
Roving Mind (competing in the Amazon automaticamente (sentimento dell’utente,
Alexa Prize Challenge 2017) and Amazon Dialogue Acts e lunghezza della conver-
Alexa users. First, we assess the com- sazione) hanno bassa, ma significativa
plexity of the task by asking two experts correlazione con il giudizio degli utenti.
to re-annotate a sample of the dataset and Infine, riportiamo i risultati di esperi-
observe that the subjectivity of user rat- menti volti a esplorare diverse combi-
ings yields a low upper-bound. Second, nazioni di queste features per addestrare
through an analysis of the entire dataset we modelli di valutazione automatica del di-
show that automatically extracted features alogo. Questo lavoro mostra la difficoltà
such as user sentiment, Dialogue Acts and del predire i giudizi soggettivi degli utenti
conversation length have significant, but in conversazioni senza un task specifico.
low correlation with user ratings. Finally,
we report the results of our experiments
exploring different combinations of these 1 Introduction
features to train automatic dialogue evalu-
ation models. Our work suggests that pre- We are currently witnessing a proliferation of con-
dicting subjective user ratings in open do- versational agents in both industry and academia.
main conversations is a challenging task. Nevertheless, core questions regarding this tech-
nology remain to be addressed or analysed in
Italiano. I modelli stato dell’arte per la greater depth. This work focuses on one such
valutazione automatica di agenti conver- question: can we automatically predict user rat-
sazionali open-domain hanno una scarsa ings of a dialogue with a conversational agent?
correlazione con il giudizio umano op- Metrics for task-based systems are generally
pure richiedono costose annotazioni oltre related to the successful completion of the task.
al punteggio dato alla conversazione. In Among these, contextual appropriateness (Danieli
questo lavoro investighiamo la possibilità and Gerbino, 1995) evaluates, for example, the
di apprendere modelli di valutazione at- degree of contextual coherence of machine turns
traverso il solo utilizzo di punteggi umani with respect to user queries which are classified
dati all’intera conversazione. Il corpus with ternary values for slots (appropriate, inappro-
priate, and ambiguous). The approach is some- predict user ratings in such context.
what similar to the attribute-value matrix of the In order to do so, we utilize a dataset of non-
popular PARADISE dialog evaluation framework task based spoken conversations between Ama-
(Walker et al., 1997), where there are matrices rep- zon Alexa users and Roving Mind (Cervone et al.,
resenting the information exchange requirements 2017), our open-domain system for the Amazon
between the machine and users towards solving Alexa Prize Challenge 2017 (Ram et al., 2017).
the dialog task, as a measure of task success rate. As an upper bound for the rating prediction task,
Unlike task-based systems, non-task-based con- we re-annotate a sample of the corpus using ex-
versational agents (also known as chitchat mod- perts and analyse the correlation between expert
els) do not have a specific task to accomplish (e.g. and user ratings. Afterwards, we analyse the en-
booking a restaurant). The goal of these can ar- tire corpus using well-known automatically ex-
guably be defined as the conversation itself, i.e. tractable features (user sentiment, Dialogue Acts
the entertainment of the human it is conversing (both user and machine), conversation length and
with. Thus, human judgment is still the most re- average user turn length), which show a low, but
liable evaluation tool we have for such conversa- still significant correlation with user ratings. We
tional agents. Collecting user ratings for a system, show how different combinations of these fea-
however, is expensive and time-consuming. tures together with a LSA representation of the
user turns can be used to train a regression model
In order to deal with these issues, researchers
whose predictions also yield a low, but significant
have been investigating automatic metrics for non-
correlation with user ratings. Our results indicate
task based dialogue evaluation. The most popu-
the difficulty of predicting how users might rate
lar of these metrics (e.g. BLEU (Papineni et al.,
interactions with a conversational agent.
2002), METEOR (Banerjee and Lavie, 2005)) rely
on surface text similarity (word overlaps) between 2 Data Collection
machine and reference responses to the same ut-
terances. Notwithstanding their popularity, such The dataset analysed in this paper was collected
metrics are hardly compatible with the nature of over a period of 27 days during the Alexa Prize
human dialogue, since there could be multiple ap- 2017 semifinals and consists of conversations be-
propriate responses to the same utterance with no tween our system Roving Mind and Amazon
word overlap. Moreover, these metrics correlate Alexa users of the United States. The users could
weakly with human judgments (Liu et al., 2016). end the conversation whenever they wanted, using
Recently, a few studies proposed metrics hav- a command. At the end of the interaction users
ing a better correlation with human judgment. were asked to rate a conversation on a 1 (not sat-
ADEM (Lowe et al., 2017) is a model trained on isfied at all) to 5 (very satisfied) Likert scale. Out
appropriateness scores manually annotated at the of all the rated conversations, we selected the ones
response-level. Venkatesh et al. (2017) and Guo longer than 3 turns to yield 4,967 conversations.
et al. (2017) combine multiple metrics, each cap- Figure 1 shows the distribution (in percentages)
turing a different aspect of the interaction, and of the ratings in our dataset. The large majority of
predict conversation-level ratings. In particular, conversations are between a system and a “first-
Venkatesh et al. (2017) shows the importance of time” users, as only 5.25% of users had more than
metrics such as coherence, conversational depth one conversation.
and topic diversity, while Guo et al. (2017) pro-
poses topic-based metrics. However, these stud-
3 Methodology
ies require extensive manual annotation on top of In this section we describe conversation represen-
conversation-level ratings. tation features, experimentation, and evaluation
In this work, we investigate non-task based di- methodologies used in the paper.
alogue evaluation models trained without relying
on any further annotations besides conversation- 3.1 Conversation Representation Features
level user ratings. Our goal is twofold: investigat- Since in the competition the objective of the sys-
ing conversation features which characterize good tem was to entertain users, we expect the ratings
interactions with a conversational agent and ex- to reflect how much they have enjoyed the inter-
ploring the feasibility of training a model able to action. User “enjoyment” can be approximated
40 (LSA) is used to convert a conversation to a vec-
36 Users
34 tor. First, we construct a word-document co-
Experts
Conversation Volume (%)
31 32 occurrence matrix and normalize it. Then, we re-
30 28 All ratings
duce the dimensionality to 100 by applying Singu-
23 lar Value Decomposition (SVD).
20 18
16 3.2 Correlation Analysis Methodology
14 14 13
11 11
10 9 The two widely used correlation metrics are Pear-
7
son correlation coefficient (PCC) and Spearman’s
rank correlation coefficient (SRCC). While the
1 2 3 4 5 former evaluates the linear relationship between
Rating variables, the latter evaluates the monotonic one.
The metrics are used to assess correlations of
Figure 1: Distribution of user and expert ratings different conversation features, such as sentiment
on the annotated random sample of 100 conversa- score or conversation length, with the provided hu-
tions (test set) compared to the distribution of rat- man ratings for those conversations; as well as to
ings in the entire dataset (“All ratings”). For clar- assess the correlation of the predicted scores of the
ity of presentation, from the latter we excluded the regression models to those ratings. For the assess-
small portion of non integer ratings (2.3% of the ment of the correlation of both features and regres-
dataset). sion models raw rating predictions are used.

using different metrics that do not require manual 3.3 Prediction Methodology
annotation, such as conversation length (in turns),
mean turn length (in words), assuming that the Using the conversation features described above,
more users enjoy the conversation the longer they we train regression models to predict human rat-
talk; sentiment polarity – hypothesizing that en- ings. We experiment with both Linear Regression
joyable conversations should carry a more posi- and Support Vector Regression (SVR) with radial
tive sentiment. While length metrics are straight- basis function (RBF) kernel using scikit-learn (Pe-
forward to compute, the sentiment score is com- dregosa et al., 2011). Since the latter consistently
puted using a lexicon-based approach (Kennedy outperforms the former, we report only the results
and Inkpen, 2006). for the SVR. The performance of the regression
Another representation that could shed a light models is evaluated using the standard metrics
on enjoyable conversations is Dialogue Acts (DA) of Root Mean Squared Error (RMSE) and Mean
of user and machine utterances. DAs are fre- Absolute Error (MAE). Additionally, we compute
quently used as a generic representation of intents Pearson and Spearman’s Rank Correlation Coeffi-
and the considered labels often include thanking, cients for the predictions with respect to the refer-
apologies, opinions, statements and alike. Rela- ence human ratings.
tive frequencies of these tags potentially can be We experiment with the 10-fold cross-
useful to distinguish good and bad conversations. validation setting. The performance of the
The DA tagger we use is the one described in regression models is compared to two baselines:
Mezza et al. (2018) trained on the Switchboard Di- (1) mean baseline, where all instances in the
alogue Acts corpus (Stolcke et al., 2000), a subset testing fold are assigned as a score the mean of
of Switchboard (Godfrey et al., 1992) annotated the training set ratings, and (2) chance baseline,
with DAs (42 categories), using Support Vector where an instance is randomly assigned a rating
Machines. The user and machine DAs are con- from 1 to 5 with respect to their distribution in
sidered as separate vectors and assessed both indi- the training set. The models are compared for
vidually and jointly. statistical significance to these baselines using
Additional to Dialogue Acts, sentiment and paired two-tail T-test with p < 0.05. In Section
length features, we experiment with word-based 6 we report average RMSE and MAE as well as
text representation. Latent Semantic Analysis average correlation coefficients.
RMSE MAE PCC SRCC Feature PCC SRCC
Exp 1 vs. Exp 2 0.875 0.660 0.705 0.694 Conversation Length 0.133** 0.111**
Exp 1 vs. Users 1.225 0.966 0.538 0.526 Av. User Turn Length -0.068** -0.079**
Exp 2 vs. Users 1.286 1.016 0.401 0.370 User Sentiment 0.071** 0.088**
User Dialogue Acts
Table 1: Root Mean Squared Error (RMSE), Mean yes-answer 0.081** 0.088**
appreciation 0.070** 0.115**
Absolute Error (MAE), Pearson (PCC) and Spear- thanking 0.062** 0.089**
man’s rank (SRCC) correlation coefficients among action-directive -0.069** -0.052**
user and expert ratings. statement-non-opinion 0.050** 0.037**
...
Machine Dialogue Acts
yes-no-question 0.042** 0.038**
4 Upper bound statement-opinion -0.027** -0.032**
...
Since human ratings are inherently subjective, and
different users can rate the same conversation dif-
Table 2: Pearson (PCC) and Spearman’s rank
ferently, it is difficult to expect the models to yield
(SRCC) correlation coefficients for conversation
perfect correlations or very low RMSE and MAE.
lengths, sentiment score, and user and machine
In order to test this hypothesis two human experts
Dialogue Acts. Correlations significant with p <
(members of our Alexa Prize team) were asked to
0.05 are marked with * and p < 0.01 with **.
rate a random subset of the corpus (100 conver-
sations). The rating distributions for both experts
and users on the sample is reported in Figure 1. Due to the space considerations, we report only
We observe that expert ratings tend to be closer to a portion of the DAs that have significant correla-
the middle of the Likert scale (i.e. from 2 to 4), tions with human ratings. The analysis confirms
while users had more conversations with ratings at our expectations that user DAs, such as thanking
both extremes of the scale (i.e. 1 and 5). and appreciation, have significant positive corre-
The RMSE, MAE and Pearson and Spearman’s lations. We also observe that the action-directive
rank correlation coefficients of expert and user rat- DA has a negative correlation. Since this DA label
ings are reported in Table 1. We observe that covers the turns where a user issues control com-
the experts tend to agree with each other more mands to the system, we hypothesize this corre-
than they agree individually with users, since com- lation could be due to the fact that in such cases
pared to each other the experts have the highest users were using a task-based approach with our
Pearson and Spearman correlation scores (0.705 system which was instead designed for chitchat
and 0.694, respectively) and the lowest RMSE and and might therefore feel disappointed (e.g. re-
MAE (0.875 and 0.660, respectively). The fact questing the Roving Mind system to perform ac-
that expert ratings do not correlate with user rat- tions it was not designed to perform, such as play-
ings as well as they correlate among themselves, ing music).
confirms the difficulty of the task of predicting Regarding machine DAs, we observe that even
subjective user ratings even for humans. though some DAs exhibit significant correlations,
overall they are lower than user DAs. In particular,
5 Correlation Analysis Results yes-no-question has a significant positive correla-
tion with human judgments, indicating that some
The results of the correlation analysis are reported
users appreciate machine initiative in the conver-
in Table 2. From the table, we can observe
sation. The analysis confirms the utility of length
that conversation length has a positive correlation
and sentiment features, as well as the importance
with human judgment, while the average user turn
of some DAs (generic intents) for estimating user
length has a negative correlation. The positive cor-
ratings.
relation with conversation length confirms the ex-
pectation that users tend to have longer conversa- 6 Prediction Results
tions with the system when they enjoy it. The neg-
ative correlation with average user turn length, on The results of the experiments using 10-fold cross-
the other hand, is unexpected. As expected, sen- validation and Support Vector Regression are re-
timent score has a significant positive correlation ported in Table 3. We report performances of each
with human judgments. feature representation is isolation and their combi-
RMSE MAE PCC SRCC
BL: Chance 1.967* 1.535* 0.007** 0.023**
BL: Mean 1.382* 1.189* N/A N/A
Lengths 1.400* 1.116* 0.153** 0.158**
Sentiment 1.423* 1.128* 0.109** 0.122**
DA: user 1.378* 1.106* 0.213** 0,207**
DA: machine 1.418* 1.129* 0.104** 0.099**
DA: user+machine 1.375* 1.106* 0.219** 0.211**
LSA 1.350* 1.075* 0.299** 0.288**
All - LSA 1.366* 1.100* 0.240** 0.230**
All 1.350* 1.078* 0.303** 0.290**

Table 3: 10 fold cross-validation average Root Mean Squared Error (RMSE), Mean Absolute Error
(MAE), Pearson (PCC) and Spearman’s rank (SRCC) correlation coefficients for regression models.
RMSE and MAE significantly better than the baselines are marked with *. Correlations significant with
p < 0.05 are marked with * and p < 0.01 with **.

nations. We consider two baselines – chance and achieves a little worse MAE score. While it yields
mean. For the chance baseline an instance is ran- the best Pearson and Spearman’s rank correlation
domly assigned a rating with respect to the train- coefficients among all the models, the difference
ing set distribution. For the mean baseline, on the from LSA only model is not statistically relevant
other hand, all the instances are assigned the mean using Fisher r-to-z transformation.
of the training set as a rating. The mean base-
line yields better RMSE and MAE scores; conse- 7 Conclusions
quently, we compare the regression models to it. In this work we experimented with a set of au-
Sentiment and length features (conversation and tomatically extractable black-box features which
average user turn) both yield RMSE higher than correlate with the human perception of the quality
the mean baseline and MAE significantly lower of interactions with a conversational agent. Fur-
than it. Nonetheless, their predictions have sig- thermore, we showed how these features can be
nificant positive correlations with reference hu- combined to train automatic non-task-based dia-
man ratings. The picture is similar for the mod- logue evaluation models which correlate with hu-
els trained on user and machine DAs alone and man judgments without further expensive annota-
their combination. The RMSE scores are higher tions.
or insignificantly lower and MAE scores are sig- The results of our experiments and analysis con-
nificantly lower than the mean baseline. tribute to the body of observations that indicate
For the LSA representation of conversations we that there still remains a lot of research to be done
consider ngram sizes between 1 and 4. The repre- in order to understand characteristics of enjoyable
sentation that considers 4-grams and the SVD di- conversations with open-domain non-task oriented
mension of 100 yields better performances; thus, agents. In particular, our analysis of expert vs.
we report the performances of this models only, user ratings suggests that the task of estimating
and use it for feature combination experiments. subjective user ratings is a difficult one, since the
The LSA model yields significantly lower error same conversation might be rated quite differently.
both in terms of RMSE and MAE. Additionally, For the future work, we plan to extend our cor-
the correlation of the predictions is higher than for pus to include interactions with multiple conversa-
the other features (and combinations). tional agents and task-based systems, as well as to
explore other features that might be relevant for as-
The regression model trained on all features but
sessing human judgment of interaction with a con-
LSA, yields performances significantly better than
versational agent (e.g. emotion recognition).
the mean baseline. However, they are inferior to
that of LSA alone. Combination of all the fea-
tures retains the best RMSE of the LSA model, but
References the 40th annual meeting on association for compu-
tational linguistics, pages 311–318. Association for
Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An Computational Linguistics.
automatic metric for mt evaluation with improved
correlation with human judgments. In Proceedings F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
of the acl workshop on intrinsic and extrinsic evalu- B. Thirion, O. Grisel, M. Blondel, P. Pretten-
ation measures for machine translation and/or sum- hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-
marization, volume 29, pages 65–72. sos, D. Cournapeau, M. Brucher, M. Perrot, and
E. Duchesnay. 2011. Scikit-learn: Machine learn-
Alessandra Cervone, Giuliano Tortoreto, Stefano ing in Python. Journal of Machine Learning Re-
Mezza, Enrico Gambi, and Giuseppe Riccardi. search.
2017. Roving mind: a balancing act between open–
domain and engaging dialogue systems. In Alexa Ashwin Ram, Rohit Prasad, Chandra Khatri, Anu
Prize Proceedings. Venkatesh, Raefer Gabriel, Qing Liu, Jeff Nunn,
Behnam Hedayatnia, Ming Cheng, Ashish Nagar,
Morena Danieli and Elisabetta Gerbino. 1995. Metrics Eric King, Kate Bland, Amanda Wartick, Yi Pan,
for evaluating dialogue strategies in a spoken lan- Han Song, Sk Jayadevan, Gene Hwang, and Art Pet-
guage system. In Proceedings of the 1995 AAAI tigrue. 2017. Conversational ai: The science behind
spring symposium on Empirical Methods in Dis- the alexa prize. In Alexa Prize Proceedings.
course Interpretation and Generation, volume 16,
pages 34–39. Andreas Stolcke, Klaus Ries, Noah Coccaro, Eliza-
beth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul
John J Godfrey, Edward C Holliman, and Jane Mc- Taylor, Rachel Martin, Carol Van Ess-Dykema, and
Daniel. 1992. Switchboard: Telephone speech cor- Marie Meteer. 2000. Dialogue act modeling for
pus for research and development. In Acoustics, automatic tagging and recognition of conversational
Speech, and Signal Processing, 1992. ICASSP-92., speech. Computational linguistics, 26(3):339–373.
1992 IEEE International Conference on, volume 1,
pages 517–520. IEEE. Anu Venkatesh, Chandra Khatri, Ashwin Ram, Fenfei
Guo, Raefer Gabriel, Ashish Nagar, Rohit Prasad,
Fenfei Guo, Angeliki Metallinou, Chandra Khatri, Ming Cheng, Benham Hedayatnia, Angeliki Met-
Anirudh Raju, Anu Venkatesh, and Ashwin Ram. allinou, Rahul Goel, Shaohua Yang, and Anirudh
2017. Topic-based evaluation for conversational Raju. 2017. On evaluating and comparing con-
bots. In NIPS 2017 Conversational AI workshop. versational agents. In NIPS 2017 Conversational AI
Alistair Kennedy and Diana Inkpen. 2006. Senti- workshop.
ment classification of movie reviews using contex- Marilyn A Walker, Diane J Litman, Candace A Kamm,
tual valence shifters. Computational intelligence, and Alicia Abella. 1997. Paradise: A framework
22(2):110–125. for evaluating spoken dialogue agents. In Proceed-
Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Nose- ings of the eighth conference on European chap-
worthy, Laurent Charlin, and Joelle Pineau. 2016. ter of the Association for Computational Linguistics,
How not to evaluate your dialogue system: An em- pages 271–280. Association for Computational Lin-
pirical study of unsupervised evaluation metrics for guistics.
dialogue response generation. In Proceedings of the
2016 Conference on Empirical Methods in Natural
Language Processing, pages 2122–2132.
Ryan Lowe, Michael Noseworthy, Iulian Vlad Ser-
ban, Nicolas Angelard-Gontier, Yoshua Bengio, and
Joelle Pineau. 2017. Towards an automatic turing
test: Learning to evaluate dialogue responses. In
Proceedings of the 55th Annual Meeting of the As-
sociation for Computational Linguistics, volume 1,
pages 1116–1126.
Stefano Mezza, Alessandra Cervone, Giuliano Tor-
toreto, Evgeny A. Stepanov, and Giuseppe Riccardi.
2018. Iso-standard domain-independent dialogue
act tagging for conversational agents. In Proceed-
ings of COLING 2018, the 27th International Con-
ference on Computational Linguistics: Technical
Papers, pages 3539–3551.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic
evaluation of machine translation. In Proceedings of