End-to-End Learning for Conversational Recommendation: A Long Way to Go? Dietmar Jannach Ahtsham Manzoor University of Klagenfurt, Austria University of Klagenfurt, Austria dietmar.jannach@aau.at ahtsham.manzoor@aau.at ABSTRACT 1 INTRODUCTION Conversational Recommender Systems (CRS) have received in- In Conversational Recommender Systems (CRS), a software agent creased interest in recent years due to advances in natural language interacts with users in a multi-turn dialog with the goal of support- processing and the wider use of voice-controlled smart assistants. ing them in finding items that match their preferences [9]. From One technical approach to build such systems is to learn, in an an interaction perspective, such systems therefore go beyond the end-to-end way, from recorded dialogs between humans. Recent one-shot recommendation paradigm of typical recommenders that proposals rely on neural architectures for learning such models. can be found, e.g., on e-commerce sites, as they try to elicit the These models are often evaluated both with the help of computa- user’s specific preferences in an interactive conversation. Such CRS tional metrics and with the help of human annotators. In the latter have been in the focus of researchers for more than two decades case, the task of human judges may consist of assessing the utter- now, starting with early critiquing approaches in the mid-1990s [7]. ances generated by the models, e.g., in terms of their consistency Since then, a number of alternative technical approaches were pro- with previous dialog utterances. posed, ranging from more elaborate critiquing techniques [3], over However, such assessments may tell us not enough about the knowledge-based advisory systems [8], to learning-based systems true usefulness of the resulting recommendation model, in particu- that are based, e.g., on reinforcement learning techniques [5, 15]. lar when the judges only assess how good one model is compared In recent years, CRS have gained increased research interest to another. In this work, we therefore analyze the utterances gen- again, mostly due to technological progress in the context of nat- erated by two recent end-to-end learning approaches for CRS on ural language processing (NLP) and the wide-spread use of voice- an absolute scale. Our initial analyses reveals that for each system controlled smart assistants like, e.g., Apple’s Siri, Amazon’s Alexa about one third of the system utterances are not meaningful in the or Google Home. Differently from many previous works that are given context and would probably lead to a broken conversation. based on substantial amounts of statically defined domain knowl- Furthermore, about less than two third of the recommendations edge (e.g., about item properties or pre-defined dialog states and were considered to be meaningful. Interestingly, none of the two transitions), some recent approaches try to adopt an end-to-end systems “generated” utterances, as almost all system responses learning approach [4, 11–13]. Informally speaking, the main task in were already present in the training data. Overall, our works shows such an approach is to learn a machine learning model given a set that (i) current approaches that are published at high-quality re- of recommendation dialogs that were held between humans. One search outlets may have severe limitations regarding their usability promise of such solutions is that the amount of knowledge engi- in practice and (ii) our academic evaluation approaches for CRS neering can be kept low and that such a system should continuously should be reconsidered. improve when more dialogs become available. The evaluation of the usefulness of such CRS in academic envi- CCS CONCEPTS ronments in general is, however, challenging. To assess the quality of a system, it is not only important to check if the recommenda- • Information systems → Recommender systems; tions are adequate in a given dialog situations, one has also to assess the quality of the dialog itself. Dialog quality could, for example, KEYWORDS also relate to the question if a system is able to react to chit-chat Conversational Recommender Systems; Evaluation. utterances (phatic expressions) in an appropriate way. In some recent works, researchers use a combination of objective ACM Reference Format: and subjective measures to assess a CRS. Objective measures can, Dietmar Jannach and Ahtsham Manzoor. . End-to-End Learning for Conver- for example, be typical recommendation accuracy measures, but sational Recommendation:, A Long Way to Go?. In IntRS Workshop at ACM may also include linguistic measures like perplexity that capture the RecSys ’20, September 2020, Online. ACM, New York, NY, USA, 5 pages. fluency of natural language [4]. Subjective evaluations sometimes include judgments of independent human evaluators. In [4], for example, the task of the evaluators was to rate the consistency of a system-generated utterance at a given dialog state on an absolute scale. In [12], a ranking of different alternatives was requested from IntRS@RecSys ’20, September 2020, Online © Copyright (c) 2020 for this paper by its authors. Use permitted under Creative the annotators. Such a form of human evaluation however has some Commons License Attribution 4.0 International (CC BY 4.0).IntRS ’20 - Joint Workshop limitations. In case of relative comparisons, we do not know if any on Interfaces and Human Decision Making for Recommender Systems, September 26, 2020, Virtual Event. of the compared systems are useful at all. In case of an absolute scale, the KBRD system from [4], for example, reached a consistency score of about 2 on a scale from one to three. This, however, cannot on the assumed interests of the seeker. The conversations were inform us fully about how useful such systems are in practice. In collected through a web-based interface, where the crowdworkers a deployed application, users might, for example, not tolerate too typed their utterances in natural language. At least four movie many conversation breakdowns, where the system is incapable of mentions were required per dialog, and each dialog had to have responding in a reasonable way. at least ten utterances. The resulting conversations were then fur- In this work, we therefore analyze quality aspects of two state- ther enriched. Movie mentions were tagged with movie names and of-the-art end-to-end learning systems ([4, 12]) in a complementary release years. Furthermore, different labels were assigned to the way. Specifically, we ask human evaluators to assess, using a binary movies, e.g., whether or not a dialog seeker has seen or liked it. scale, if a given response by the system appears meaningful to Original Evaluation. The DeepCRS system was evaluated in dif- them or not. Examples of non-meaningful utterances would be a ferent dimensions, including the quality of sentiment classification, repetition of what was previously said or a system utterance that recommendation quality, and overall dialog assessment. Accuracy does not match the context of the dialog. In addition, we asked was measured in terms of the Cohen’s kappa coefficient [6] given the evaluators to judge every item recommendation in a subjective the like/dislike labels in the ReDial dataset, which however has a way. very skewed distribution with over 90% like statements. The more Our analysis shows that about one third of the utterances gener- interesting part was the human evaluation. Here, ten participants ated by the investigated systems are considered to be not meaning- of a user study were given ten dialogs from the ReDial dataset that ful in the given dialog context. Moreover, in both of the systems, contained 56 system-generated utterances. The task of the partici- about more than one third of the recommendations did not suit the pant was to rank each utterance compared to (a) the true utterance assumed preferences of the recommendation seeker. These obser- in the original dialog and (b) the utterance generated by the HRED vations raise questions both regarding the practical usefulness of model [16], which was used as a baseline. The results of the ranking the proposed systems and the way we evaluate CRS in academia. A exercise showed that the human recommendations were most often main implication of our analyses is that more realistic ways of eval- considered to be the best ones and that the proposed DeepCRS uating CRS are needed. In particular, such evaluation approaches model was better than HRED. An example of different responses in should help us understand if a (i) proposed end-to-end learning a given situation is shown in Table 1. system reaches a quality level, in terms of generating plausible responses, that is actually acceptable for users and if it is (ii) able to avoid bad recommendations that can be detrimental to the system’s Table 1: Example of Conversation with Alternative Recom- quality perception and use [2]. mendation Utterances used for Evaluation [12]. 2 ANALYZED APPROACHES ... Technical Approaches – DeepCRS and KBRD. We analyzed two re- SEEKER: 2001 : a space odyssey might be a great option. any- cent approaches from the literature. The first one, which we denote thing else that you would suggest ? as DeepCRS, was published at NeurIPS 2018 [12]. Its architecture HUMAN: you can try planet of the apes the older one is quite consists of four sub-components, which accomplish different tasks suspenseful and family friendly . such as sentence encoding, next-utterance prediction, sentiment HRED: i haven’t seen that one yet but i heard it was good. i classification and recommendation. Technically, the architecture haven’t seen that one. have you seen the last house on is inspired by the hierarchical HRED architecture from [16] and the left ? based on RNNs and an autoencoder for the recommendation task. DeepCRS: star wars : the force awakens is also a good one return The second approach is called KBRD [4] (Knowledge-based Rec- of the jedi all good movies ommender Dialog System) and was published at EMNLP-IJCNLP ’19. The system’s components include a Transformer-based sequence- to-sequence module as a dialog system, a knowledge graph that The KBRD approach was evaluated in three dimensions. Two captures dialog-external knowledge about the domain (movies), computational metrics, perplexity and distinct n-gram measure the and a switching network that connects the dialog and knowledge fluency and diversity of the natural language. Recommendation module. quality was measured in terms of Recall. KBRD proved to be fa- vorable over two baselines (DeepCRS and a Transformer model) Underlying Data – The ReDial Dataset. Both approaches, Deep- in terms of all computational metrics. For the human evaluation, CRS and KBRD, are trained on the ReDial1 dataset. This dataset ten annotators with knowledge in linguistics were asked to assess was collected by the authors of DeepCRS with the help of crowd- the consistency of a generated utterance with the previous dialog workers and consists of more than 10,000 conversations between history on a scale from 1 to 3. An average consistency rating of 1.99 a recommendation seeker and a recommender. The crowdworkers was obtained, which was about 15% higher than the average rating were given specific instructions regarding the conversation: For for the DeepCRS baseline. example, each participant had to take one of two roles, seeker or Overall, the (relative) ranking exercise for DeepCRS unfortu- recommender. The seeker had to specify which movies she likes, nately does not tell us much about the absolute meaningfulness and the recommender’s task was to make recommendations based and usefulness of the system-generated utterances. For KBRD, the evaluation was done on an absolute scale. The average score was at 1 https://redialdata.github.io/website/ about 2 on the 1-3 scale, i.e., in the middle. No details are, however, provided regarding the distribution of the ratings. It is, for exam- that typical non-meaningful sentences include situations where ple, not clear if trivial system responses like goodbye to a seeker’s the system ignores the last user-utterance intent, repeats questions, goodbye were counted as consistent dialog continuations. abruptly ends the conversation, provides broken or incomplete re- Given the difficulty of assessing the usefulness of the proposed sponse, or makes a bad recommendation. The judgment of what systems from the reported studies, our goal was to assess the quality represents a bad recommendation is in many cases clear, e.g., when of the system responses through a complementary analysis. the system recommended a movie that the seeker has just men- tioned, but to some extent it remains a subjective assessment. 3 ANALYSIS METHODOLOGY Looking at example conversations published in the supplementary 4 RESULTS material of [12]2 , we found that even in these hand-selected exam- Analysis of Generated Sentences. Table 2 shows the characteristics ples many system responses (labeled as OURS) were not meaningful. of the utterances that are generated by DeepCRS and KBRD. Even One main goal of our analysis was to quantify the extent of such though both algorithms were fed with the same dialogs to compete, problems. Our analysis procedure was as follows. the number of generated sentences for KBRD is lower as this method First, we randomly selected 70 dialogs from the ReDial test did not always return a response. dataset. We then used the code provided by the authors of Deep- Regarding the novelty of the returned sentences, we, to some CRS and KBRD to reproduce the systems and to generate system surprise, found that DeepCRS almost exclusively returns sentences responses after each seeker utterance, given the dialog history up that are found in identical form in the training data (except for the to that point. As a result, we obtained 70 dialogs, which not only placeholder for movie names that are eventually replaced by the contained the original seeker and human recommender utterances, algorithm). KBRD also mostly returns sentences that are contained but also the recommender sentences that were generated by the in the training data or are tiny modifications of such sentences. Five respective CRS. of the generated sentences were not in the training data. However, In total, 758 system responses, 399 by DeepCRS and 359 by KBRD, there were also 11 generated sentences that were broken. were generated this way. We analyzed the responses both through Overall, it is surprising that both systems mostly return sen- manual and automated processes. For replicability, we share all tences they found in the training data, which resembles more a study materials online3 . The following main analyses were made: retrieval approach than a language generation problem. Measuring (1) Creativeness or novelty of responses wrt. training data; linguistic properties, e.g., perplexity on the sentence level, of what (2) Meaningfulness of system responses in the given context; are genuinely human sentences, therefore is not too meaningful. (3) Quality of the recommendations; Analysis of Dialog and Recommendation Quality. In Table 3, we In the context of analysis (1), we were wondering how different show the results of the labeling process by the annotators. The the system-generated responses are from the training data. This is numbers in the table correspond to the rounded average of two in particular relevant as the authors of KBRD measure perplexity annotators who, as mentioned above, have a very high agreement. and n-gram distance for the generated sentences. In our analysis, we therefore counted which fraction of the system responses was Table 2: Characteristics of Generated Sentences contained in an identical or almost identical form4 in the training DeepCRS KBRD data. In case the generated sentences were mostly identical to sen- Generated Sentences 399 359 tences appearing in the training data, measuring perplexity and the Unique Sentences 46 159 n-gram distance of what are mostly genuine human utterances is Identical in Training Data 44 87 not very informative. Almost Identical in Training Data 2 59 To measure aspects (2) and (3), we relied on human annotators New Sentences 0 5 who marked each generated system response in the 70 dialogs Broken Sentences 0 11 as being meaningful or not. Furthermore, we asked them to label each utterance as being chit-chat or containing a recommendation. Two of the three annotators were PhD students at two universities Table 3: Analysis of Dialog and Recommendation Quality with no background in conversational recommendation. One was DeepCRS KBRD evaluating the DeepCRS responses, the other annotated the KBRD Number of dialogs 70 70 responses. They were not informed about the background of their Generated sentences (overall) 399 359 task. To obtain a second opinion and to avoid potential biases, both datasets were also manually labeled by one of the authors of this Sentences labeled as meaningful 277 (69%) 209 (58%) paper. The annotator agreement was generally very high (92.73% Sentences labeled as not meaningful 122 (31%) 150 (42%) for DeepCRS and 93.89% for KBRD). Dialogs without problems 5 5 When instructing the external annotators, we did not provide Chit-chat sentences 132 88 specific instructions what “meaningful” means. Our analysis shows Chit-chat labeled as meaningful 112 (85%) 77 (87%) 2 https://papers.nips.cc/paper/8180-towards-deep-conversational-recommendations Number of recommendations 106 119 3 https://drive.google.com/drive/folders/10gPOmaiFrZjIULIa3LsdmuyvJvnCV_Xq Recs. labeled as meaningful 63 (60%) 66 (55%) 4 We considered sentences to be almost identical if the same set of words appeared in Nb. dialogs with no meaningful recs. 25 (36%) 20 (28%) them with the same frequency, i.e., only the order was changed. Nb. dialogs with no rec. made. 7 (10%) 6 (8.5%) The results show that for both systems a substantial fraction of The DeepCRS and KBRD systems did not provide explanations the generated responses—about 40%—were considered not mean- in the dialogs that we examined, except in cases where the system ingful by the annotators. As a result, there are only 5 (7%) dialogs generated some sort of confirmatory utterances (“it is a very good for which there is not at least one issue. Overall, these findings movie”). Such utterances appeared in the training data and were raise the question if such high failure rates would be acceptable by correspondingly sometimes selected by the systems. The number of users in practice? user intents that are actually supported by DeepCRS and KBRD are A major fraction of the generated sentences (33% for DeepCRS, generally quite low, see [1, 9] for a list of possible intents in CRS. 24% for KBRD) were considered chit-chat. The analyzed systems For the DeepCRS system, for example, the authors also explicitly were performing better in terms of generating such chit-chat mes- state that with their recommendation mechanism they are unable sages than they were when generating other types of utterances. to respond to a seeker who asks for “a good sci-fi movie” [12]. Not The percentage of meaningful chit-chat responses is 85% and 87% being able to support intents related to explanations or feature- for DeepCRS and KBRD respectively. However, a larger fraction of based requests again raises questions about the practicability of the these chit-chat exchanges consist of trivial responses to ’hello’, ’hi’, investigated approaches. ’goodbye’ and ’thank you’ utterances by the recommendation seeker. Finally, when examining the dialogs that we sampled, we ob- Overall, the chit-chat messages account for 39% of all generated served that a few conversations were broken. This was for example sentences that were marked as meaningful. the case when a crowdworker had not understood the instructions. The analysis of the quality of the recommendations themselves is In one case, for example, the seeker was not interested in a recom- what we thought would be a more subjective part of our evaluation. mendation, but rather told the recommender that he would like The agreement between the annotators was, however, very high to watch a certain movie. In our sample of 70 dialogs, we found (93% for DeepCRS and 92% for KBRD). The annotators both relied on 9 cases we considered to be broken. To what extent such noise in their own expertise in the movie domain and used external sources the data impacts the performance of end-to-end learning systems like movie databases such as IMDb to check the plausibility of the however requires more investigations in the future. recommendations. Recommendations were typically considered not meaningful when the annotators could not establish any plausible 5 CONCLUSIONS AND IMPLICATIONS link between seeker preferences and recommendations. An example is the system’s recommendation of the movie “The Secret Life We performed an alternative and independent evaluation of two of Pets” after the seeker mentioned that s/he liked “Avengers - recently published end-to-end learning approaches to building con- Infinity Wars”. Quite interestingly, the subjective performance of versational recommender systems. A manual inspection of the re- KBRD is lower than for DeepCRS even though KBRD includes a sponses of the two systems reveals that these systems in many cases knowledge graph, called DBpedia 5 , that contains with information fail to react in a meaningful way to user utterances. The quality of about movies and their relationships. the recommendations in these dialogs also appears to be limited. Overall, the results in Table 3 show that the perceived recom- Our findings have important implications. First, current evalua- mendation quality is modest. While for the DeepCRS and KBRD tion practices, at least those from the analyzed papers, seem to be systems, less than two third of the movie recommendations were not informative enough to judge the practical usefulness of such considered meaningful. However, DeepCRS produced not even a systems. Relying on relative subjective comparisons (as in [12]) single recommendation in 7 (10%) dialogs and this was the case for cannot inform as about whether or not the better-ranked system is 6 (8.5%) dialogs with KBRD. actually good. Absolute evaluations (as in [4]) indicated mediocre outcomes, but the aggregation of the human ratings into one single Limitations of the ReDial Dataset. The existence of large-scale metric value prevents us from understanding how good the sys- datasets containing human conversations is a key prerequisite for tem works for specifics parts of the conversation (e.g., chit-chat, building a CRS based on end-to-end learning. The ReDial dataset is recommendation). an important step in that direction. However, the dataset also has a Measuring linguistic aspects like perplexity—or the BLEU score number of limitations. This mostly has to do with the way it was as done in some other works [10, 13]—might in principle be helpful, created with the help of crowdworkers, which were given specific even though there are some concerns regarding the correspondence instructions about the minimum number of interactions and the of BLEU scores with human perceptions [14]. However, our analysis minimum number of movie mentions. revealed that the examined systems almost exclusively generate As a result, many dialogs are not much longer than the mini- sentences that were already present in the training data in identical mum length, and the dialogs do not enter deeper discussions. The or almost identical form. These objective measures, when applied on expression of preferences is very often based on movie mentions the sentence level, would therefore mainly judge, e.g., the perplexity and only to a lesser extent based on preferences regarding certain of the sentences by human recommenders. Therefore, a retrieval features like genre or directions. The responses by the human rec- based approach might achieve the same performance or even better. ommenders are also mainly movie mentions. An explanation why This is an interesting future direction that we intend to explore. a recommendation is a good match for the seeker’s preferences are As a result, our work calls for extended, alternative, and more not very common. Developing an end-to-end system that is capable realistic evaluation practices for CRS. In particular, in practical of providing explanations, which might be a helpful feature in any applications, certain guarantees regarding the quality of the system CRS, therefore might remain challenging. responses and recommendations might be required, which might 5 https://wiki.dbpedia.org/ be difficult to achieve with current end-to-end learning approaches. REFERENCES [10] Dongyeop Kang, Anusha Balakrishnan, Pararth Shah, Paul Crook, Y-Lan Boureau, [1] Wanling Cai and Li Chen. 2020. Predicting User Intents and Satisfaction with and Jason Weston. 2019. Recommendation as a Communication Game: Self- Dialogue-Based Conversational Recommendations. In UMAP ’20. 33–42. Supervised Bot-Play for Goal-oriented Dialogue. In EMNLP-IJCNLP ’19. 1951– [2] Patrick Y.K. Chau, Shuk Ying Ho, Kevin K.W. Ho, and Yihong Yao. 2013. Examining 1961. the effects of malfunctioning personalized services on online users’ distrust and [11] Dongyeop Kang, Anusha Balakrishnan, Pararth Shah, Paul A. Crook, Y-Lan behaviors. Decision Support Systems 56 (2013), 180–191. Boureau, and Jason Weston. 2019. Recommendation as a Communication Game: [3] Li Chen and Pearl Pu. 2012. Critiquing-based recommenders: survey and emerg- Self-Supervised Bot-Play for Goal-oriented Dialogue. ArXiv abs/1909.03922 ing trends. User Modeling and User-Adapted Interaction 22, 1-2 (2012), 125–150. (2019). [4] Qibin Chen, Junyang Lin, Yichang Zhang, Ming Ding, Yukuo Cen, Hongxia Yang, [12] Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz, Vincent Michalski, Laurent and Jie Tang. 2019. Towards Knowledge-Based Recommender Dialog System. In Charlin, and Chris Pal. 2018. Towards deep conversational recommendations. In EMNLP-IJCNLP ’19. 1803–1813. NIPS ’18. 9725–9735. [5] Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. 2016. Towards [13] Lizi Liao, Ryuichi Takanobu, Yunshan Ma, Xun Yang, Minlie Huang, and Tat-Seng conversational recommender systems. In KDD ’16. 815–824. Chua. 2019. Deep Conversational Recommender in Travel. ArXiv abs/1907.00710 [6] Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational (2019). and psychological measurement 20, 1 (1960), 37–46. [14] Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Noseworthy, Laurent Charlin, and [7] Kristian J. Hammond, Robin Burke, and Kathryn Schmitt. 1994. Case-Based Joelle Pineau. 2016. How NOT To Evaluate Your Dialogue System: An Empirical Approach to Knowledge Navigation. In AAAI ’94. Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. In [8] Dietmar Jannach. 2004. ADVISOR SUITE – A Knowledge-based Sales Advisory EMNLP ’16. 2122–2132. System. In ECAI ’04. 720–724. [15] Tariq Mahmood and Francesco Ricci. 2009. Improving Recommender Systems [9] Dietmar Jannach, Ahtsham Manzoor, Wanling Cai, and Li’e Chen. 2020. A Survey with Adaptive Conversational Strategies. In HT ’09. 73–82. on Conversational Recommender Systems. ArXiv abs/2004.00646 (2020). [16] Iulian V. Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2016. Building End-to-End Dialogue Systems Using Generative Hierar- chical Neural Network Models. In AAAI ’16. 3776–3783.