A Dataset of Real Dialogues for Conversational Recommender Systems Andrea Iovine Fedelucio Narducci Marco de Gemmis Department of Computer Science, University of Bari Aldo Moro, Italy firstname.lastname@uniba.it Abstract that acquire the user’s profile in an interactive manner (Mahmood and Ricci, 2009). This means Conversational Recommender Systems that, in order to receive a recommendation, the (CoRS) that use natural language to inter- system does not require that all the information is act with users usually need to be trained provided beforehand, but it guides the user in an on large quantities of text data. Since interactive, human-like dialog (Jugovac and Jan- the utterances used during the interaction nach, 2017). Even though a CoRS can be im- with a CoRS may be different depending plemented using several different interfaces, it is on the domain of the items, the system reasonable to think that an interaction based on should also be trained separately for each natural language is suitable for the task. In par- domain. So far, there are no publicly avail- ticular, Digital Assistants (DA) such as Amazon able datasets based on real dialogues for Alexa, Google Assistant, or Apple’s Siri are in- training the components of a CoRS. In this teresting platforms to deliver recommendations in paper, we propose three datasets that are a conversational manner. DAs, popularized with useful for training a CoRS in the movie, the diffusion of smartphones, are able to help users book, and music domains. These datasets complete everyday tasks through a conversation in have been collected during a user study natural language. However, there is still a techno- for evaluating a CoRS. They can be used logical gap between CoRSs and DAs, as described to train several components, such as the in (Rafailidis and Manolopoulos, 2018). In par- Intent Recognizer, Entity Recognizer, and ticular, one of the main causes of that gap is the Sentiment Recognizer. lack of labeled data. In fact, implementing a nat- ural language-based interface for a CoRS is not easy, as it requires the use of several Natural Lan- 1 Introduction guage Understanding (NLU) operations. For ex- ample, a basic conversational recommender needs Recommender Systems (RS) are software systems at least three NLU components: an Intent Rec- that help people make better decisions (Jameson ognizer, an Entity Recognizer, and a Sentiment et al., 2015). They have become a fundamental Analyzer. These components need to be trained tool for overcoming the information overloading on large quantities of real sentences, which may problem, which is caused by the ever-increasing not always be available. The problem is worsened variety of information and products that people by the fact that each component may need to be can access (Ricci et al., 2011). Choosing between trained separately for each different domain. such a large quantity of options is not easy, and this results in a decrease in the quality of the de- In this paper, we present three datasets that cisions. Recommender systems help alleviate the contain utterances used in real dialogues between problem by providing personalized suggestions to users and a CoRS respectively in the movie, book, users, based on their preferences. and music domains. These datasets can then be Conversational Recommender Systems (CoRS) used to train the components of a new CoRS. To are a particular type of Recommender Systems, the best of our knowledge, this is the first time such a dataset of real dialogues is provided for Copyright 2019 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- the book and music domains, while there is al- ternational (CC BY 4.0). ready one example for the movie domain (Li et al., 2018). The dataset is available at the follow- a goal-oriented information-retrieval Conversa- ing link1 . tional Agent, that is able to find items in a database Section 2 contains a literature review of datasets given a set of constraints. The main objective of for training Question Answering and Conversa- the authors was to add memory capabilities to the tional Recommender Systems. Section 3 illus- CA. Each message is annotated using frames. trates the architecture of the CoRS that was used to collect the messages in the dataset. Section 4 de- Suglia et al. (2017) propose an automatic proce- scribes in detail the three datasets, providing some dure for generating plausible synthetic dialogues statistics, and a small example of conversation. for movie-based CoRSs. This procedure takes in input a movie recommendation dataset (such as 2 Related Work MovieLens), and turns each set of user preferences into a full conversation. The datasets created with The problem of finding dialogues between humans this procedure can be used for training an End-to- and machines is not new, and in literature there are End Conversational Recommender System. The already some examples of conversational datasets purpose is then very similar to that of our contri- that can be used to train a new conversational bution. However, we provide user-generated mes- agent. Serban et al. (2015) published a literature sages, rather than synthetic ones. survey of natural language datasets for CoRSs and Question Answering systems. Kang et al. (2017) investigated how peo- Dodge et al. (2015) presented a dataset for ple interact with a natural language-based CoRS the evaluation of the performance of End-to-End through voice or text. To do this, the authors devel- Conversational Agents (CA), with a focus on the oped a natural language interface, and integrated movie domain. End-to-End CAs use a single (usu- it in the MovieLens system. Then, they recorded ally deep learning-based) model to learn directly a the messages written (or spoken) by the users, response, given a user utterance. The objective of i.e. what kinds of queries do they use. From the the dataset is to test the Question Answering and collected data, the authors classified three types Recommendation abilities. The dataset is gener- of recommendation goals, and several types of ated synthetically using data from MovieLens and follow-up queries. Data from 347 users was col- Open Movie Database, and consists of 3.5 mil- lected, and subsequently released. While interest- lion training examples, covering 75,000 movie ing, this dataset does not specifically aim to train a entities. This work differs from our contribution new CoRS. for several reasons. The most important difference is that our dataset is not used to learn what items Li et al. (2018) developed ReDial, a dataset to recommend, but rather, how to understand the consisting of over 10,000 conversations, with the user utterances. Thus, it is independent of the rec- objective of providing movie recommendations. ommendation algorithm used. Furthermore, our This dataset was conceived to train deep learning- dataset includes the book and music domains, and based components, namely a sentiment analyzer only uses real dialogues. and a recommendation algorithm. According to Braun et al. (2017) also developed two datasets the authors, it is the only real-world, two-party for the evaluation of QA systems. The first dataset conversational corpus for CoRSs. The dataset was contains questions about public transport, and was used to train a movie-based CoRS that uses com- collected through a Telegram chatbot. It consists ponents based on deep learning, such as RNN for of 206 manually annotated questions. The second sentiment analysis, and an autoencoder for the rec- dataset contains data collected from two StackEx- ommendation. This dataset is probably the most change platforms, and consists of 290 questions similar to the one presented in this paper. How- and answers. The datasets were created to com- ever, it differs from it for two reasons: first, we pare several NLP platforms in terms of their ability provide datasets for three domains, rather than just to recognize intents and entities for a QA system. the movie domain. Second, as stated earlier, our Asri et al. (2017) presented the Frames dataset, dataset is independent from the recommendation a corpus of 1369 dialogs generated through a algorithm, and it only has the objective to under- Wizard-of-Oz setting. It was created to train stand how to maintain the conversation and ac- 1 https://github.com/aiovine/converse-dataset quire the user’s preferences. 3 A Multi-Domain Conversational Recommender System The dataset presented in this work is the result of the development and testing of a multi-domain Conversational Recommender System. The sys- tem is able to communicate with users via mes- sages in natural language, both in acquiring their preferences, and providing suggestions. The rec- ommendation process can be divided into two Figure 1: Architecture of the CoRS parts: a preference acquisition phase and a rec- ommendation phase. In the first phase, the user is user says ”I like Michael Jackson”, the pref- able to talk to the system freely. Preferences are erence intent is recognized. The Intent Rec- expressed in the form of liked or disliked items. ognizer is powered by DialogFlow2 . For example, a user can use a sentence like ”I love Stephen King, but I don’t like The Shining”. Mul- • Entity Recognizer: This component is re- tiple ratings can be given in the same sentence, sponsible for recognizing entities mentioned and also can be given to different types of items by the user. Given the previous example, it is (in this case, an author and his book). In case of able to recognize Michael Jackson as an en- ambiguity, the system may ask the user to clarify tity mention. It exploits Wikidata3 , and does (disambiguate). not require any training. This component was Once enough preferences are provided, the rec- developed in-house. ommendation phase may start. This is done by asking for recommendations (e.g. ”What book • Sentiment Analyzer: This component is re- can I read today?”). During the recommendation sponsible for recognizing the user’s senti- phase, the system suggests a set of items, each of ment on the recognized entities. Given the which can be rated positively or negatively by the previous example, it recognizes a positive rat- user. A critiquing function also allows the user to ing for Michael Jackson. This component is criticize some aspects of the suggested item (e.g. developed using Stanford CoreNLP4 . ”I like this movie, but I don’t like Mel Gibson”). It • Recommendation Services: This compo- is also possible to ask for more details about the nent is responsible for the recommendation recommended item, for a trailer/preview, or for algorithm. In particular, we use a Content- an explanation (e.g. ”Why did you suggest this Based recommender based on the PageRank song?”). with priors. Our CoRS uses a modular architecture, that is made up of several components, each with a spe- cific responsibility. It was deployed as a Telegram 4 ConveRSE Datasets chatbot, but it can be easily ported to any other In this section, we describe the main features of messaging platform, such as Facebook Messen- the dataset and the process that we used to build ger or any others. The components in question (as it. The dialogues were recorded during an experi- seen in Figure 1) are: mental session, in which participants were asked • Dialog Manager: This component is respon- to interact with three CoRSs, each for a spe- sible for maintaining a conversation with the cific domain (movie, books, and music). Dur- user in a persistent way. It decides what ac- ing the preference acquisition phase, each partic- tion should be performed given the user in- ipant wrote some positive/negative ratings. After tent, invokes the other components, aggre- that, participants were asked to request a recom- gates their outputs, and produces the final re- mendation, and then evaluated five recommended sponse. items. Finally, users asked the system to view their profiles. From this experiment, we collected • Intent Recognizer: This component is re- 2 https://dialogflow.com/ sponsible for understanding the action that 3 https://www.wikidata.org 4 the user is requesting. For example, when the https://stanfordnlp.github.io/CoreNLP/ Movie Book Music entities are correct. To do this, each conversation was split into tasks, where a task is defined as a se- #Users 149 56 56 #Messages 5318 1862 2096 quence of messages with a specific goal. For each #Messages per user 35.7 33.3 37.4 task, we observed whether it terminated success- #Preference messages 2172 734 1011 fully, or an anomaly occurred. Some examples of #Recomm. requests 456 369 144 tasks that are completed correctly are: %Liked (Preference) 89.8 91.6 93.5 • A preference message, followed by one or %Disliked (Preference) 10.2 8.40 6.54 more disambiguations; %Liked (Recomm.) 77.6 77.7 73.2 • A recommendation request, followed by one %Disliked (Recomm.) 22.4 22.3 26.8 or more preferences to the recommended %Critiquing 1.6 0.0 0.42 item, requests for details and explanations; %Details requests 11.4 3.6 2.08 • A request for showing the profile. %Preview requests 6.98 1.7 0.625 %Explanation requests 10.5 1.49 2.5 Some examples of tasks that are not completed %To check 39.6 28.8 26.0 correctly are: • Any task containing a fallback intent (means that the intent was not recognized) Table 1: ConveRSE dataset statistics • Tasks in which the user asks to skip a disam- biguation request, or to stop the recommen- 5,318 messages for the movie domain, 1,862 for dation phase; the book domain, and 2,096 for the music domain. • Tasks in which an unexpected intent is found For each message, we collected the user’s utter- (e.g. preference to an unrelated item during ance, the intent recognized by the system, unique the recommendation phase). IDs for the user and the message, a timestamp, a For each message, we added a field called list of contexts, a list of recognized items, and a toCheck. This field is set to false if the message set of actions. We chose not to include the sys- is part of a completed task, true otherwise. In the tem’s responses in the dataset, since they are gen- latter case, it is advised to manually check the cor- erated via a template. Instead, we report a set of rectness of the intent. actions that together map the reaction of the sys- Table 4 describes some statistics extracted from tem to the user message, and the current status of the dataset. More precisely, we collected the num- the conversation. For example, the recommenda- ber of users and messages, the number of pref- tion action means that the user is in the recommen- erence messages and recommendation requests, dation phase. The question action means that the the average number of messages per user, the system responded to the user by asking a question percentage of liked and disliked items (both in (i.e. requesting a disambiguation, or asking the the preference acquisition and recommendation user to rate a recommended item). Finally, the fin- phases), the percentage of critiquing, details, pre- ished recommendation actions signal that the mes- view and explanation requests (over all recom- sage concludes a recommendation phase. An item mended items), and the percentage of messages is included in the list of recognized items only for which toCheck is equal to true. For privacy after it was correctly disambiguated (if a disam- reasons, we anonymized the dialogues by replac- biguation was needed). For example, if the user ing the original Telegram user ID with a numerical writes ”I like Tom Cruise”, the system responds index. ”You said that you like Tom Cruise, can you be more specific? Possible values are: producer, ac- 4.1 Example of conversation tor”. Only when the user responds to this ques- In this section, we describe a small example of a tion the item will be recorded as recognized in the conversation between a user and the movie-based dataset. For each recognized item, we record its instance of the CoRS. For each message in Table Wikidata ID, and a symbol that identifies the rat- 2, we describe the utterance along with the main ing (’+’ for positive, ’-’ for negative). features, in order to make the underlying dialog We applied some heuristics for improving the model more understandable. The following para- quality of the data. In particular, the objective is graphs contain a short explanation for each mes- to understand whether the recognized intents and sage. For brevity reasons, the example contains # Message Intent Recognized objects Status 1 I like the avengers preference question, disambiguation 2 The Avengers (2012) preference - disambiguation Q182218+ 3 Suggest some film request recommendation recommendation, question 4 I like this movie request recommendation - preference Q14171368+ recommendation, question 5 Why do you suggest this movie? request recommendation - why recommendation, question 6 I love it, but I don’t like director request recommendation - yes but Q220192+ recommendation, question 7 Can you show my preferences show profile Table 2: Short example of conversation in the movie dataset messages from different conversations, in order to mation when associating a property (i.e. director) show more intents with fewer messages. to a recommended item, however it could be ig- 1. The user has provided a preference during nored when training a new CoRS. the preference acquisition phase. The recognized 7. In this case, the user is requesting to see intent is then preference. Since there are multiple his/her profile, as indicated by the show profile in- movies matching with The Avengers, further dis- tent. This can be optionally followed by requests ambiguation is required. This is indicated via the for editing or deleting the profile. question and disambiguation actions. 5 Conclusions 2. The user has answered the disambiguation request, by specifying that he/she means the movie In this paper, we presented three datasets that ”The Avengers (2012)”. This is associated with contain real user messages sent to Conversational the preference - disambiguation intent. Note that Recommender Systems in the movie, book, and only now the movie was included in the recog- music domains. The datasets can be used to train nized objects field. a new CoRS to detect the intents, and with a few 3. When the user sends this message, a new modifications, also to recognize entities and sen- recommendation phase is started. The correspond- timents. The size of the data that we provide ing intent is request recommendation. When this may not be sufficient to train deep learning-based happens, the system proposes a movie that will be End-to-End conversational recommendation mod- rated by the user. The actions question and rec- els. However, this is outside the scope of our work: ommendation are used to indicate that the CoRS is as stated in the previous sections, the aim of our expecting a rating from the user. datasets is to learn a conversational recommenda- 4. When the user provides a rating to a rec- tion dialog model, independently from the actual ommended entity (in this case, I like this movie), recommendation algorithm. In any case, we be- the request recommendation - preference intent is lieve that this is the first time that a dataset for used. The rating of the recommended item is also training CoRSs in the book and music domain is registered in the recognized objects field. The rec- released. Also, we believe that this is a good start- ommendation and question actions in this case sig- ing point for the release of further conversational nify that the system responds by presenting an- datasets in multiple domains. other recommended movie to rate. We propose, as future work, to expand the 5. In this case, the user asks an expla- datasets, by collecting more messages, in more do- nation for the recommended item. The re- mains. We will also explore the possibility to use quest recommendation - why is used in this case. our datasets to evaluate new CoRSs. After the explanation was given, the system asks again to rate the movie, as evidenced by the References recorded actions. Layla El Asri, Hannes Schulz, Shikhar Sharma, 6. Here, the user provides the rating, but also Jeremie Zumer, Justin Harris, Emery Fine, Rahul criticizes the recommendation, by adding a neg- Mehrotra, and Kaheer Suleman. 2017. Frames: A ative rating to the director of the recommended Corpus for Adding Memory to Goal-Oriented Di- movie (previously mentioned as critiquing). The alogue Systems. arXiv:1704.00057 [cs], March. arXiv: 1704.00057. request recommendation - yes but intent is used in this case. Our CoRS requests an additional confir- Daniel Braun, Adrian Hernandez-Mendez, Florian Matthes, and Manfred Langen. 2017. Evaluating Natural Language Understanding Services for Con- versational Question Answering Systems. In Pro- ceedings of the 18th Annual SIGdial Meeting on Dis- course and Dialogue, pages 174–185, Saarbrcken, Germany. Association for Computational Linguis- tics. Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander Miller, Arthur Szlam, and Jason Weston. 2015. Evaluating Pre- requisite Qualities for Learning End-to-End Dialog Systems. arXiv:1511.06931 [cs], November. arXiv: 1511.06931. Anthony Jameson, Martijn C. Willemsen, Alexander Felfernig, Marco de Gemmis, Pasquale Lops, Gio- vanni Semeraro, and Li Chen, 2015. Human De- cision Making and Recommender Systems, page 611648. Springer US. Michael Jugovac and Dietmar Jannach. 2017. Inter- acting with RecommendersOverview and Research Directions. ACM Transactions on Interactive Intel- ligent Systems, 7(3):1–46, September. Jie Kang, Kyle Condiff, Shuo Chang, Joseph A. Kon- stan, Loren Terveen, and F. Maxwell Harper. 2017. Understanding How People Use Natural Language to Ask for Recommendations. In Proceedings of the Eleventh ACM Conference on Recommender Sys- tems - RecSys ’17, pages 229–237, Como, Italy. ACM Press. Raymond Li, Samira Kahou, Hannes Schulz, Vincent Michalski, Laurent Charlin, and Chris Pal. 2018. Towards Deep Conversational Recommendations. page 17. Tariq Mahmood and Francesco Ricci. 2009. Improv- ing recommender systems with adaptive conversa- tional strategies. In Proceedings of the 20th ACM conference on Hypertext and hypermedia - HT ’09, page 73, Torino, Italy. ACM Press. Dimitrios Rafailidis and Yannis Manolopoulos. 2018. The Technological Gap Between Virtual Assistants and Recommendation Systems. arXiv:1901.00431 [cs], December. arXiv: 1901.00431. Francesco Ricci, Lior Rokach, and Bracha Shapira, 2011. Introduction to recommender systems hand- book. Springer US. Iulian Vlad Serban, Ryan Lowe, Peter Henderson, Lau- rent Charlin, and Joelle Pineau. 2015. A Survey of Available Corpora for Building Data-Driven Di- alogue Systems. arXiv:1512.05742 [cs, stat], De- cember. arXiv: 1512.05742. Alessandro Suglia, Claudio Greco, Pierpaolo Basile, Giovanni Semeraro, and Annalina Caputo. 2017. An Automatic Procedure for Generating Datasets for Conversational Recommender Systems. page 2.