An Automatic Procedure for Generating Datasets for Conversational Recommender Systems

People have information needs of varying complexity, which can be solved by an intelligent agent able to answer questions formulated in a proper way, eventually considering user context and preferences. Conversational Recommender Systems (CRS) assist online users in their information-seeking and decision making tasks by supporting an interactive process [1] with the aim of finding the most appealing items according to the user preferences.

Unfortunately, collecting dialogues data required for the training phase of these systems can be really labour-intensive, especially for the latest data-hungry Deep Learning models. For this reason, synthetic dialogue datasets can be extremely useful in order to bootstrap effective dialogue systems able to support a goal-oriented conversation with the user. Therefore, we propose an automatic procedure able to generate plausible dialogues directly from well-known recommender systems datasets exploiting data coming from the Linked Open Data Cloud and contextual information related to the user.

Given a user u and his/her set of binary preferences, we trained a decision tree from the user u preferences expressed towards items represented using Linked Open Data binary features extracted from the Wikidata3 knowledge base. In particular, each pair predicate-object is represented as a binary feature which is 1 if

The dialogue datasets generated from MovieLens 1M and MovieTweetings datasets can be found at: http://github.com/swapUniba/ConvRecSysDataset. The source code of the automatic procedure for generating conversational recommender systems datasets will be released when the paper will be accepted.

the knowledge base contains the triple (item, predicate, object), 0 otherwise. The considered predicates are wdt:P57 (director), wdt:P161 (cast member), wdt:P136 (genre)4 . The dialogue generation procedure is an iterative algorithm which is executed until all user preferences have been used. At each step of the dialog generation procedure, a top-n list of items composed by positive and negative items is generated by randomly choosing from positive and negative preferences of the given user u. Then, paths from the root of the decision tree to the consistently classified examples are exploited to generate a sequence of questions, randomly chosen according to a binomial distribution over the item features, to elicit user preferences. Depending from the percentage of positive items in the top-n, a "refine" step is triggered which extends the dialog with additional questions that lead to a list of suggestions which contains only positive items.

Table 1 shows a conversation generated by applying the designed procedure to the well-known MovieLens 1M recommender systems dataset. In the first part of the conversation, utterances with the aim of introducing the user are generated by exploiting the contextual information included in the dataset. In this work we have proposed an automatic procedure able to generate synthetic dialogue datasets starting from well-known datasets in the recommender system field. The presented procedure is completely generic and can be applied on any dataset containing binary user preferences and whose items have a corresponding identifier in the Linked Open Data Cloud.