=Paper=
{{Paper
|id=Vol-2481/paper37
|storemode=property
|title=A Dataset of Real Dialogues for Conversational Recommender Systems
|pdfUrl=https://ceur-ws.org/Vol-2481/paper37.pdf
|volume=Vol-2481
|authors=Andrea Iovine,Fedelucio Narducci,Marco de Gemmis
|dblpUrl=https://dblp.org/rec/conf/clic-it/IovineNG19
}}
==A Dataset of Real Dialogues for Conversational Recommender Systems==
<pdf width="1500px">https://ceur-ws.org/Vol-2481/paper37.pdf</pdf>
<pre>
    A Dataset of Real Dialogues for Conversational Recommender Systems

        Andrea Iovine           Fedelucio Narducci               Marco de Gemmis
            Department of Computer Science, University of Bari Aldo Moro, Italy
                              firstname.lastname@uniba.it


                       Abstract                               that acquire the user’s profile in an interactive
                                                              manner (Mahmood and Ricci, 2009). This means
     Conversational Recommender Systems                       that, in order to receive a recommendation, the
     (CoRS) that use natural language to inter-               system does not require that all the information is
     act with users usually need to be trained                provided beforehand, but it guides the user in an
     on large quantities of text data. Since                  interactive, human-like dialog (Jugovac and Jan-
     the utterances used during the interaction               nach, 2017). Even though a CoRS can be im-
     with a CoRS may be different depending                   plemented using several different interfaces, it is
     on the domain of the items, the system                   reasonable to think that an interaction based on
     should also be trained separately for each               natural language is suitable for the task. In par-
     domain. So far, there are no publicly avail-             ticular, Digital Assistants (DA) such as Amazon
     able datasets based on real dialogues for                Alexa, Google Assistant, or Apple’s Siri are in-
     training the components of a CoRS. In this               teresting platforms to deliver recommendations in
     paper, we propose three datasets that are                a conversational manner. DAs, popularized with
     useful for training a CoRS in the movie,                 the diffusion of smartphones, are able to help users
     book, and music domains. These datasets                  complete everyday tasks through a conversation in
     have been collected during a user study                  natural language. However, there is still a techno-
     for evaluating a CoRS. They can be used                  logical gap between CoRSs and DAs, as described
     to train several components, such as the                 in (Rafailidis and Manolopoulos, 2018). In par-
     Intent Recognizer, Entity Recognizer, and                ticular, one of the main causes of that gap is the
     Sentiment Recognizer.                                    lack of labeled data. In fact, implementing a nat-
                                                              ural language-based interface for a CoRS is not
                                                              easy, as it requires the use of several Natural Lan-
1    Introduction                                             guage Understanding (NLU) operations. For ex-
                                                              ample, a basic conversational recommender needs
Recommender Systems (RS) are software systems                 at least three NLU components: an Intent Rec-
that help people make better decisions (Jameson               ognizer, an Entity Recognizer, and a Sentiment
et al., 2015). They have become a fundamental                 Analyzer. These components need to be trained
tool for overcoming the information overloading               on large quantities of real sentences, which may
problem, which is caused by the ever-increasing               not always be available. The problem is worsened
variety of information and products that people               by the fact that each component may need to be
can access (Ricci et al., 2011). Choosing between             trained separately for each different domain.
such a large quantity of options is not easy, and
this results in a decrease in the quality of the de-             In this paper, we present three datasets that
cisions. Recommender systems help alleviate the               contain utterances used in real dialogues between
problem by providing personalized suggestions to              users and a CoRS respectively in the movie, book,
users, based on their preferences.                            and music domains. These datasets can then be
   Conversational Recommender Systems (CoRS)                  used to train the components of a new CoRS. To
are a particular type of Recommender Systems,                 the best of our knowledge, this is the first time
                                                              such a dataset of real dialogues is provided for
     Copyright 2019 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-     the book and music domains, while there is al-
ternational (CC BY 4.0).                                      ready one example for the movie domain (Li et
al., 2018). The dataset is available at the follow-     a goal-oriented information-retrieval Conversa-
ing link1 .                                             tional Agent, that is able to find items in a database
   Section 2 contains a literature review of datasets   given a set of constraints. The main objective of
for training Question Answering and Conversa-           the authors was to add memory capabilities to the
tional Recommender Systems. Section 3 illus-            CA. Each message is annotated using frames.
trates the architecture of the CoRS that was used to
collect the messages in the dataset. Section 4 de-         Suglia et al. (2017) propose an automatic proce-
scribes in detail the three datasets, providing some    dure for generating plausible synthetic dialogues
statistics, and a small example of conversation.        for movie-based CoRSs. This procedure takes in
                                                        input a movie recommendation dataset (such as
2       Related Work                                    MovieLens), and turns each set of user preferences
                                                        into a full conversation. The datasets created with
The problem of finding dialogues between humans
                                                        this procedure can be used for training an End-to-
and machines is not new, and in literature there are
                                                        End Conversational Recommender System. The
already some examples of conversational datasets
                                                        purpose is then very similar to that of our contri-
that can be used to train a new conversational
                                                        bution. However, we provide user-generated mes-
agent. Serban et al. (2015) published a literature
                                                        sages, rather than synthetic ones.
survey of natural language datasets for CoRSs and
Question Answering systems.
                                                           Kang et al. (2017) investigated how peo-
   Dodge et al. (2015) presented a dataset for
                                                        ple interact with a natural language-based CoRS
the evaluation of the performance of End-to-End
                                                        through voice or text. To do this, the authors devel-
Conversational Agents (CA), with a focus on the
                                                        oped a natural language interface, and integrated
movie domain. End-to-End CAs use a single (usu-
                                                        it in the MovieLens system. Then, they recorded
ally deep learning-based) model to learn directly a
                                                        the messages written (or spoken) by the users,
response, given a user utterance. The objective of
                                                        i.e. what kinds of queries do they use. From the
the dataset is to test the Question Answering and
                                                        collected data, the authors classified three types
Recommendation abilities. The dataset is gener-
                                                        of recommendation goals, and several types of
ated synthetically using data from MovieLens and
                                                        follow-up queries. Data from 347 users was col-
Open Movie Database, and consists of 3.5 mil-
                                                        lected, and subsequently released. While interest-
lion training examples, covering 75,000 movie
                                                        ing, this dataset does not specifically aim to train a
entities. This work differs from our contribution
                                                        new CoRS.
for several reasons. The most important difference
is that our dataset is not used to learn what items
                                                           Li et al. (2018) developed ReDial, a dataset
to recommend, but rather, how to understand the
                                                        consisting of over 10,000 conversations, with the
user utterances. Thus, it is independent of the rec-
                                                        objective of providing movie recommendations.
ommendation algorithm used. Furthermore, our
                                                        This dataset was conceived to train deep learning-
dataset includes the book and music domains, and
                                                        based components, namely a sentiment analyzer
only uses real dialogues.
                                                        and a recommendation algorithm. According to
   Braun et al. (2017) also developed two datasets
                                                        the authors, it is the only real-world, two-party
for the evaluation of QA systems. The first dataset
                                                        conversational corpus for CoRSs. The dataset was
contains questions about public transport, and was
                                                        used to train a movie-based CoRS that uses com-
collected through a Telegram chatbot. It consists
                                                        ponents based on deep learning, such as RNN for
of 206 manually annotated questions. The second
                                                        sentiment analysis, and an autoencoder for the rec-
dataset contains data collected from two StackEx-
                                                        ommendation. This dataset is probably the most
change platforms, and consists of 290 questions
                                                        similar to the one presented in this paper. How-
and answers. The datasets were created to com-
                                                        ever, it differs from it for two reasons: first, we
pare several NLP platforms in terms of their ability
                                                        provide datasets for three domains, rather than just
to recognize intents and entities for a QA system.
                                                        the movie domain. Second, as stated earlier, our
   Asri et al. (2017) presented the Frames dataset,
                                                        dataset is independent from the recommendation
a corpus of 1369 dialogs generated through a
                                                        algorithm, and it only has the objective to under-
Wizard-of-Oz setting. It was created to train
                                                        stand how to maintain the conversation and ac-
    1
        https://github.com/aiovine/converse-dataset     quire the user’s preferences.
3    A Multi-Domain Conversational
     Recommender System
The dataset presented in this work is the result
of the development and testing of a multi-domain
Conversational Recommender System. The sys-
tem is able to communicate with users via mes-
sages in natural language, both in acquiring their
preferences, and providing suggestions. The rec-
ommendation process can be divided into two                       Figure 1: Architecture of the CoRS
parts: a preference acquisition phase and a rec-
ommendation phase. In the first phase, the user is              user says ”I like Michael Jackson”, the pref-
able to talk to the system freely. Preferences are              erence intent is recognized. The Intent Rec-
expressed in the form of liked or disliked items.               ognizer is powered by DialogFlow2 .
For example, a user can use a sentence like ”I love
Stephen King, but I don’t like The Shining”. Mul-           • Entity Recognizer: This component is re-
tiple ratings can be given in the same sentence,              sponsible for recognizing entities mentioned
and also can be given to different types of items             by the user. Given the previous example, it is
(in this case, an author and his book). In case of            able to recognize Michael Jackson as an en-
ambiguity, the system may ask the user to clarify             tity mention. It exploits Wikidata3 , and does
(disambiguate).                                               not require any training. This component was
   Once enough preferences are provided, the rec-             developed in-house.
ommendation phase may start. This is done by
asking for recommendations (e.g. ”What book                 • Sentiment Analyzer: This component is re-
can I read today?”). During the recommendation                sponsible for recognizing the user’s senti-
phase, the system suggests a set of items, each of            ment on the recognized entities. Given the
which can be rated positively or negatively by the            previous example, it recognizes a positive rat-
user. A critiquing function also allows the user to           ing for Michael Jackson. This component is
criticize some aspects of the suggested item (e.g.            developed using Stanford CoreNLP4 .
”I like this movie, but I don’t like Mel Gibson”). It       • Recommendation Services: This compo-
is also possible to ask for more details about the            nent is responsible for the recommendation
recommended item, for a trailer/preview, or for               algorithm. In particular, we use a Content-
an explanation (e.g. ”Why did you suggest this                Based recommender based on the PageRank
song?”).                                                      with priors.
   Our CoRS uses a modular architecture, that is
made up of several components, each with a spe-
cific responsibility. It was deployed as a Telegram     4       ConveRSE Datasets
chatbot, but it can be easily ported to any other       In this section, we describe the main features of
messaging platform, such as Facebook Messen-            the dataset and the process that we used to build
ger or any others. The components in question (as       it. The dialogues were recorded during an experi-
seen in Figure 1) are:                                  mental session, in which participants were asked
    • Dialog Manager: This component is respon-         to interact with three CoRSs, each for a spe-
      sible for maintaining a conversation with the     cific domain (movie, books, and music). Dur-
      user in a persistent way. It decides what ac-     ing the preference acquisition phase, each partic-
      tion should be performed given the user in-       ipant wrote some positive/negative ratings. After
      tent, invokes the other components, aggre-        that, participants were asked to request a recom-
      gates their outputs, and produces the final re-   mendation, and then evaluated five recommended
      sponse.                                           items. Finally, users asked the system to view
                                                        their profiles. From this experiment, we collected
    • Intent Recognizer: This component is re-              2
                                                              https://dialogflow.com/
      sponsible for understanding the action that           3
                                                              https://www.wikidata.org
                                                            4
      the user is requesting. For example, when the           https://stanfordnlp.github.io/CoreNLP/
                           Movie    Book     Music     entities are correct. To do this, each conversation
                                                       was split into tasks, where a task is defined as a se-
#Users                     149      56       56
#Messages                  5318     1862     2096
                                                       quence of messages with a specific goal. For each
#Messages per user         35.7     33.3     37.4      task, we observed whether it terminated success-
#Preference messages       2172     734      1011      fully, or an anomaly occurred. Some examples of
#Recomm. requests          456      369      144       tasks that are completed correctly are:
%Liked (Preference)        89.8     91.6     93.5         • A preference message, followed by one or
%Disliked (Preference)     10.2     8.40     6.54            more disambiguations;
%Liked (Recomm.)           77.6     77.7     73.2         • A recommendation request, followed by one
%Disliked (Recomm.)        22.4     22.3     26.8            or more preferences to the recommended
%Critiquing                1.6      0.0      0.42            item, requests for details and explanations;
%Details requests          11.4     3.6      2.08         • A request for showing the profile.
%Preview requests          6.98     1.7      0.625
%Explanation requests      10.5     1.49     2.5       Some examples of tasks that are not completed
%To check                  39.6     28.8     26.0      correctly are:
                                                         • Any task containing a fallback intent (means
                                                            that the intent was not recognized)
      Table 1: ConveRSE dataset statistics
                                                         • Tasks in which the user asks to skip a disam-
                                                            biguation request, or to stop the recommen-
5,318 messages for the movie domain, 1,862 for              dation phase;
the book domain, and 2,096 for the music domain.         • Tasks in which an unexpected intent is found
   For each message, we collected the user’s utter-         (e.g. preference to an unrelated item during
ance, the intent recognized by the system, unique           the recommendation phase).
IDs for the user and the message, a timestamp, a       For each message, we added a field called
list of contexts, a list of recognized items, and a    toCheck. This field is set to false if the message
set of actions. We chose not to include the sys-       is part of a completed task, true otherwise. In the
tem’s responses in the dataset, since they are gen-    latter case, it is advised to manually check the cor-
erated via a template. Instead, we report a set of     rectness of the intent.
actions that together map the reaction of the sys-        Table 4 describes some statistics extracted from
tem to the user message, and the current status of     the dataset. More precisely, we collected the num-
the conversation. For example, the recommenda-         ber of users and messages, the number of pref-
tion action means that the user is in the recommen-    erence messages and recommendation requests,
dation phase. The question action means that the       the average number of messages per user, the
system responded to the user by asking a question      percentage of liked and disliked items (both in
(i.e. requesting a disambiguation, or asking the       the preference acquisition and recommendation
user to rate a recommended item). Finally, the fin-    phases), the percentage of critiquing, details, pre-
ished recommendation actions signal that the mes-      view and explanation requests (over all recom-
sage concludes a recommendation phase. An item         mended items), and the percentage of messages
is included in the list of recognized items only       for which toCheck is equal to true. For privacy
after it was correctly disambiguated (if a disam-      reasons, we anonymized the dialogues by replac-
biguation was needed). For example, if the user        ing the original Telegram user ID with a numerical
writes ”I like Tom Cruise”, the system responds        index.
”You said that you like Tom Cruise, can you be
more specific? Possible values are: producer, ac-      4.1   Example of conversation
tor”. Only when the user responds to this ques-        In this section, we describe a small example of a
tion the item will be recorded as recognized in the    conversation between a user and the movie-based
dataset. For each recognized item, we record its       instance of the CoRS. For each message in Table
Wikidata ID, and a symbol that identifies the rat-     2, we describe the utterance along with the main
ing (’+’ for positive, ’-’ for negative).              features, in order to make the underlying dialog
   We applied some heuristics for improving the        model more understandable. The following para-
quality of the data. In particular, the objective is   graphs contain a short explanation for each mes-
to understand whether the recognized intents and       sage. For brevity reasons, the example contains
#   Message                                Intent                                Recognized objects   Status
1   I like the avengers                    preference                                                 question, disambiguation
2   The Avengers (2012)                    preference - disambiguation           Q182218+
3   Suggest some film                      request recommendation                                     recommendation, question
4   I like this movie                      request recommendation - preference   Q14171368+           recommendation, question
5   Why do you suggest this movie?         request recommendation - why                               recommendation, question
6   I love it, but I don’t like director   request recommendation - yes but      Q220192+             recommendation, question
7   Can you show my preferences            show profile

                             Table 2: Short example of conversation in the movie dataset


messages from different conversations, in order to                  mation when associating a property (i.e. director)
show more intents with fewer messages.                              to a recommended item, however it could be ig-
   1. The user has provided a preference during                     nored when training a new CoRS.
the preference acquisition phase. The recognized                       7. In this case, the user is requesting to see
intent is then preference. Since there are multiple                 his/her profile, as indicated by the show profile in-
movies matching with The Avengers, further dis-                     tent. This can be optionally followed by requests
ambiguation is required. This is indicated via the                  for editing or deleting the profile.
question and disambiguation actions.
                                                                    5    Conclusions
   2. The user has answered the disambiguation
request, by specifying that he/she means the movie                  In this paper, we presented three datasets that
”The Avengers (2012)”. This is associated with                      contain real user messages sent to Conversational
the preference - disambiguation intent. Note that                   Recommender Systems in the movie, book, and
only now the movie was included in the recog-                       music domains. The datasets can be used to train
nized objects field.                                                a new CoRS to detect the intents, and with a few
   3. When the user sends this message, a new                       modifications, also to recognize entities and sen-
recommendation phase is started. The correspond-                    timents. The size of the data that we provide
ing intent is request recommendation. When this                     may not be sufficient to train deep learning-based
happens, the system proposes a movie that will be                   End-to-End conversational recommendation mod-
rated by the user. The actions question and rec-                    els. However, this is outside the scope of our work:
ommendation are used to indicate that the CoRS is                   as stated in the previous sections, the aim of our
expecting a rating from the user.                                   datasets is to learn a conversational recommenda-
   4. When the user provides a rating to a rec-                     tion dialog model, independently from the actual
ommended entity (in this case, I like this movie),                  recommendation algorithm. In any case, we be-
the request recommendation - preference intent is                   lieve that this is the first time that a dataset for
used. The rating of the recommended item is also                    training CoRSs in the book and music domain is
registered in the recognized objects field. The rec-                released. Also, we believe that this is a good start-
ommendation and question actions in this case sig-                  ing point for the release of further conversational
nify that the system responds by presenting an-                     datasets in multiple domains.
other recommended movie to rate.                                       We propose, as future work, to expand the
   5. In this case, the user asks an expla-                         datasets, by collecting more messages, in more do-
nation for the recommended item.            The re-                 mains. We will also explore the possibility to use
quest recommendation - why is used in this case.                    our datasets to evaluate new CoRSs.
After the explanation was given, the system asks
again to rate the movie, as evidenced by the                        References
recorded actions.
                                                                    Layla El Asri, Hannes Schulz, Shikhar Sharma,
   6. Here, the user provides the rating, but also                    Jeremie Zumer, Justin Harris, Emery Fine, Rahul
criticizes the recommendation, by adding a neg-                       Mehrotra, and Kaheer Suleman. 2017. Frames: A
ative rating to the director of the recommended                       Corpus for Adding Memory to Goal-Oriented Di-
movie (previously mentioned as critiquing). The                       alogue Systems. arXiv:1704.00057 [cs], March.
                                                                      arXiv: 1704.00057.
request recommendation - yes but intent is used in
this case. Our CoRS requests an additional confir-                  Daniel Braun, Adrian Hernandez-Mendez, Florian
  Matthes, and Manfred Langen. 2017. Evaluating
  Natural Language Understanding Services for Con-
  versational Question Answering Systems. In Pro-
  ceedings of the 18th Annual SIGdial Meeting on Dis-
  course and Dialogue, pages 174–185, Saarbrcken,
  Germany. Association for Computational Linguis-
  tics.
Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine
   Bordes, Sumit Chopra, Alexander Miller, Arthur
   Szlam, and Jason Weston. 2015. Evaluating Pre-
   requisite Qualities for Learning End-to-End Dialog
   Systems. arXiv:1511.06931 [cs], November. arXiv:
   1511.06931.
Anthony Jameson, Martijn C. Willemsen, Alexander
  Felfernig, Marco de Gemmis, Pasquale Lops, Gio-
  vanni Semeraro, and Li Chen, 2015. Human De-
  cision Making and Recommender Systems, page
  611648. Springer US.

Michael Jugovac and Dietmar Jannach. 2017. Inter-
  acting with RecommendersOverview and Research
  Directions. ACM Transactions on Interactive Intel-
  ligent Systems, 7(3):1–46, September.

Jie Kang, Kyle Condiff, Shuo Chang, Joseph A. Kon-
   stan, Loren Terveen, and F. Maxwell Harper. 2017.
   Understanding How People Use Natural Language
   to Ask for Recommendations. In Proceedings of the
   Eleventh ACM Conference on Recommender Sys-
   tems - RecSys ’17, pages 229–237, Como, Italy.
   ACM Press.
Raymond Li, Samira Kahou, Hannes Schulz, Vincent
  Michalski, Laurent Charlin, and Chris Pal. 2018.
  Towards Deep Conversational Recommendations.
  page 17.
Tariq Mahmood and Francesco Ricci. 2009. Improv-
  ing recommender systems with adaptive conversa-
  tional strategies. In Proceedings of the 20th ACM
  conference on Hypertext and hypermedia - HT ’09,
  page 73, Torino, Italy. ACM Press.
Dimitrios Rafailidis and Yannis Manolopoulos. 2018.
  The Technological Gap Between Virtual Assistants
  and Recommendation Systems. arXiv:1901.00431
  [cs], December. arXiv: 1901.00431.
Francesco Ricci, Lior Rokach, and Bracha Shapira,
  2011. Introduction to recommender systems hand-
  book. Springer US.
Iulian Vlad Serban, Ryan Lowe, Peter Henderson, Lau-
   rent Charlin, and Joelle Pineau. 2015. A Survey
   of Available Corpora for Building Data-Driven Di-
   alogue Systems. arXiv:1512.05742 [cs, stat], De-
   cember. arXiv: 1512.05742.
Alessandro Suglia, Claudio Greco, Pierpaolo Basile,
  Giovanni Semeraro, and Annalina Caputo. 2017.
  An Automatic Procedure for Generating Datasets for
  Conversational Recommender Systems. page 2.

</pre>