=Paper= {{Paper |id=Vol-3294/long5 |storemode=property |title=INSPIRED2: An Improved Dataset for Sociable Conversational Recommendation |pdfUrl=https://ceur-ws.org/Vol-3294/long5.pdf |volume=Vol-3294 |authors=Ahtsham Manzoor,Dietmar Jannach |dblpUrl=https://dblp.org/rec/conf/recsys/ManzoorJ22a }} ==INSPIRED2: An Improved Dataset for Sociable Conversational Recommendation== https://ceur-ws.org/Vol-3294/long5.pdf
INSPIRED2: An Improved Dataset for Sociable
Conversational Recommendation
Ahtsham Manzoor∗ , Dietmar Jannach
University of Klagenfurt, Universitätsstraße 65-67, Klagenfurt am Wörthersee, 9020, Austria


                                    Abstract
                                    Conversational recommender systems (CRS) that are able to interact with users in natural language often utilize recommenda-
                                    tion dialogs which were previously collected with the help of paired humans, where one plays the role of a seeker and the other
                                    as a recommender. These recommendation dialogs include items and entities that indicate the users’ preferences. In order to
                                    precisely model the seekers’ preferences and respond consistently, CRS typically rely on item and entity annotations. A recent
                                    example of such a dataset is INSPIRED, wich consists of recommendation dialogs for sociable conversational recommendation,
                                    where items and entities were annotated using automatic keyword or pattern matching techniques. An analysis of this dataset
                                    unfortunately revealed that there is a substantial number of cases where items and entities were either wrongly annotated or
                                    annotations were missing at all. This leads to the question to what extent automatic techniques for annotations are effective.
                                    Moreover, it is important to study impact of annotation quality on the overall effectiveness of a CRS in terms of the quality
                                    of the system’s responses. To study these aspects, we manually fixed the annotations in INSPIRED. We then evaluated the
                                    performance of several benchmark CRS using both versions of the dataset. Our analyses suggest that the improved version of
                                    the dataset, i.e., INSPIRED2, helped increase the performance of several benchmark CRS, emphasizing the importance of
                                    data quality both for end-to-end learning and retrieval-based approaches to conversational recommendation. We release our
                                    improved dataset (INSPIRED2) publicly at https://github.com/ahtsham58/INSPIRED2.

                                    Keywords
                                    Conversational Recommender Systems, data quality, annotations, evaluation, dialog systems



1. Introduction                                                                                                  concepts that appear in the dialogs. In the movies do-
                                                                                                                 main, for example, being able to exactly identifying the
Sociable conversational recommender systems (CRS) aim items (i.e., movies) and related entities and concepts (e.g.,
to build rapport with users while interacting with them in actors or genres) can play a pivotal role for building an
natural language [1, 2]. CRS that rely on natural language effective system. Existing CRS for example arrange such
processing (NLP) nowadays commonly utilize datasets of entities and their relationships as graphs [5, 6], and these
previously recorded dialogs between humans, where one relationships often form the basis to model the users’
plays the role of a recommendation-seeker and the other preferences, e.g., [7, 8, 9]. Moreover, domain specific
as human-recommender, see e.g., [3]. However, due to a concepts and entities can also contribute to the genera-
certain lack of rich sociable interactions in such datasets tion of meaningful and coherent responses, especially in
[4], it can be challenging to build a sociable CRS that knowledge-aware CRS, see [10, 11, 12, 13].
builds rapport with the users using such limited data.                                                              Annotating items and entities can be a laborious and
              Therefore, it is important to develop datasets like IN- economically expensive process [14, 15]. Human costs
SPIRED [1], which includes dialogs that implement rich are high and may even be prohibitive for domains where
social communication strategies. Such rich datasets rep- particular knowledge or expertise are required to accom-
resent a solid basis to develop trustable CRS that are able plish the annotation task [16]. In that context, the qual-
to engage users in a natural and user-adaptive manner. ity of the resulting annotations is crucial, and factually
Another key factor for building high-quality CRS lies in wrong annotations can lead to errors or ambiguity for
the proper recognition of the named entities and other the downstream task. Automating the annotation task or
                                                                                                                 at least automatically verifying the annotations [14] has
4th Edition of Knowledge-aware and Conversational Recommender Sys- therefore been in the focus of research for several years.
tems (KaRS) Workshop @ RecSys 2022, September 18–23 2023, Seattle,                                                  We note here that data quality is crucial both for recent
WA, USA.
∗
     Corresponding author.
                                                                                                                 generation-based   CRS approaches as well as for retrieval-
Envelope-Open ahtsham.manzoor@aau.at (A. Manzoor);                                                               based approaches to build natural language conversa-
dietmar.jannach@aau.at (D. Jannach)                                                                              tional systems [17]. For both types of systems, the ques-
GLOBE https://ahtsham58.github.io/ (A. Manzoor); https://www.aau.at/ tion arises to what extent better data quality, i.e., having
en/aics/research-groups/infsys/team/dietmar-jannach/ (D. Jannach) correct annotations and noise-free conversations, leads
Orcid 0000-0001-9418-753 (A. Manzoor); 0000-0002-4698-8507
                                                                                                                 to better results in terms of the quality of the responses
(D. Jannach)
                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License returned by a system for a given user utterance, e.g., in
                    Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings        CEUR Workshop Proceedings (CEUR-WS.org)
               http://ceur-ws.org
               ISSN 1613-0073
                                                                                                                 terms of consistency and plausibility.
   In this work, we study the recent INSPIRED dataset,                           A number of new datasets for conversational recom-
in which the items and entities that were mentioned in                        mendation were published in recent years, e.g., [3, 21,
the recorded utterances are explicitly annotated. These                       22, 23]. Such datasets, which are commonly collected
annotations were created with the help of automatic ap-                       with the help of crowdworkers, can however have limi-
proaches using keyword or pattern matching methods.                           tations and may be not fully representative in terms of
However, looking at the data, we observed a substantial                       what we would observe in reality. In some cases, for
number of cases where items and entities were either                          example, crowdworkers were instructed to mention a
wrongly annotated or missing annotations at all, e.g.,                        minimum number of movies in the conversations. This
“My favorite [ MOVIE_GENRE_1] are Groundhogs Day,                             leads to mostly “instance-based” conversations, where
[ MOVIE_TITLE_2] and Borat”. In addition, there were                          crowdworkers rather mention individual movies they
several cases where the utterances included noise, e.g.,                      like than their preferred genres, see also, [3, 24, 25].
“How did you like QUOTATION_MARKHustlersQUOTA-                                   Another problem when creating such datasets lies in
TION _MARK?”. Finally, we found instances where regu-                         the recognition and annotation of named entities appear-
lar words were identified as being named entities. In this                    ing in the conversations, as mentioned above. Annotat-
latter case, human annotations would in fact have been                        ing entities in textual data can be a tedious process that
required.1 Overall, such issues may limit the quality of                      may require a substantial amount of manual effort and
any CRS that is built on top of such data.                                    time. To overcome this challenge, researchers sometimes
   To understand the severity of the problem and the po-                      adapt a semi-automatic approach or rely on NLP-assisted
tential effects of data issues on the quality of a CRS, we                    tools that visualize the entities in a text in order to re-
have manually corrected the dataset by fixing the annota-                     duce the required manual effort [16, 15, 26]. Generally,
tions and by removing noise from the utterances. Then,                        some automatic approaches may experience problems to
we conducted offline experiments and human evalua-                            correctly create annotations because human judgments
tions to compare the performance of different benchmark                       and opinions are required. An automated approach was
CRS when using the original (INSPIRED) and improved                           used in the context of the INSPIRED dataset. Here, the
(INSPIRED2) datasets. Overall, the results of our anal-                       items and entities were annotated using keyword or pat-
yses indicate that all CRS showed better performance                          tern matching approaches. However, verifying the out-
in different dimensions when built on INSPIRED2. In                           comes of such automatic or semi-automatic approaches
order to facilitate the design and development of future                      can again be laborious and require manual effort.
sociable CRS, we release the INSPIRED2 dataset online                            Today, structured annotations for items and entities
at https://github.com/ahtsham58/INSPIRED2.                                    mentioned in the conversations are common in recent
                                                                              datasets. For example, in the case of the ReDial dataset
                                                                              [3], the mentioned movie titles were annotated with
2. Related Work                                                               unique IDs. However, the ReDial dataset has some limi-
                                                                              tations. Various meta-data concepts (e.g., genres, actors,
In this section, we first discuss datasets and aspects of
                                                                              or directors) were not annotated. Moreover, the recorded
data quality in the context of CRS. Afterwards, we review
                                                                              dialogs include limited social interactions or explanations
different design paradigms for building CRS, followed by
                                                                              for the made recommendations. On the other hand, the
a discussion of predominant evaluation approaches for
                                                                              INSPIRED dataset includes rich sociable conversation
such systems.
                                                                              and explanation strategies for the recommended items.
                                                                              Also, aspects like movie genres or actors were explicitly
Datasets and Data Quality Research interest in CRS                            annotated too. A comparison of these differences can be
has experienced a substantial growth in recent years,                         found in [1]. The key statistics of the INSPIRED dataset
see [18, 19] for related surveys. Many current systems                        are shown in Table 1.
interact with users in natural language, and one impor-                          As mentioned earlier, the INSPIRED dataset has some
tant goal for such system is to enable them to engage in                      limitations. The keyword or pattern matching approach
conversations that reflect human behavior. Since many                         used for the annotations might for example not detect
of these recent systems are built on recorded dialogs                         misspelled keywords or concepts in an utterance. More-
between humans, the capabilities of the resulting CRS                         over, data anomalies such as noisy utterances or ill-
depend on the richness of the communication in the                            formed language can deteriorate the performance of an
datasets, e.g., in terms of the user intents that can be                      annotating algorithm, leading to challenges for the down-
found in the conversations, see [20] for an detailed anal-                    stream use of the dataset [15, 27, 28]. In reality, the level
ysis of such intents.                                                         of noise can be substantial both in real-world applica-
                                                                              tions and in purposefully created datasets. Therefore,
1
    Consider the movie “It (2017)” as an example of a difficult case, e.g.,   data quality assurance is often considered a significant
    when appearing in an utterance like “Have you seen It?”.                  and important step in NLP applications.
Table 1                                                         CRS Evaluation Evaluating a CRS is a multi-faceted
Main Statistics of INSPIRED                                     and challenging problem as it requires the consideration
                                                    Total
                                                                of various quality dimensions. An in-depth discussion
                                                                of evaluation approaches for CRS can be found in [37].
 Number of dialogs (conversations)                 1,001
                                                                Like in the recommender systems literature in general,
 Average turns per dialog                          10.73
                                                                computational experiments that do not involve humans
 Average tokens per utterance                       7.93
                                                                in the loop are the predominant instrument to assess
 Number of human-recommender utterances            18,339
                                                                the quality of a CRS. Common metrics to evaluate the
 Number of seeker utterances                       17,472       quality of the recommendations include Recall, Hit Rate,
                                                                or Precision [8, 21, 38]. Moreover, certain linguistic as-
                                                                pects such as fluency or diversity are often evaluated
Building Conversational Recommender Systems                     with offline experiments as well to assess the quality of
Research on CRS has made substantial progress in terms          the generated responses. Common metrics in this area
of their underlying technical approaches. Some early            include Perplexity, distinct N-Gram, or the BLEU score
commercial system such as Advisor Suite [29] for example        [3, 8, 13, 22].
relied on an entirely knowledge-based approach for the             Given the interactive nature of CRS, offline experi-
development of adaptive and personalized applications.          ments and the corresponding metrics have their limi-
Similarly, early critiquing-based systems were based on         tations. Mainly, it is not always clear if the results ob-
detailed knowledge about item features and possible cri-        tained from offline experiments are representative of the
tiques and had limited learning capabilities [30, 31].          user-perceived quality of the recommendations or sys-
   Technological advancements particularly in fields like       tem responses in general [35]. For example, when using
NLP, speech recognition, and machine learning in general        metrics like the BLEU score, usually a system response is
led to the design of today’s end-to-end learning-based          compared with one particular given ground truth. Such
CRS. In such approaches, recorded recommendation di-            a comparison has limitations when used to estimate the
alogs between paired humans are used to train the deep          average quality of a system’s responses, because there
neural models, see, e.g., [8, 9, 10, 12]. Given the last user   might be many different alternative responses that might
utterance and the history of the ongoigng dialog history,       be suitable as well in an ongoing dialog. Still, offline
these trained models are then used to generate responses        evaluations have their place and value. They can for
in natural language. These responses can either include         example be informative for assessing particular aspects
item recommendations, which are also computed with              such as the number of items or entities that appear in an
the help of machine learning techniques, or other types         utterance or conversation.
of conversational elements, e.g., greetings.                       Overall, given the limitations of pure offline experi-
   In terms of the underlying data, the DeepCRS [3] sys-        ments, researchers often follow a mixed approach where
tem was built on the ReDial dataset, which was created          some aspects of the system are evaluated offline and some
in the context of this work. Later on, systems were devel-      with humans. Typical quality aspects in terms of human
oped which also relied on this dataset as well but included     perceptions in such combined approaches include the
additional information sources, e.g., from DBPedia or Con-      assessment of the meaningfulness or consistency of the
ceptNet [32, 12], to build knowledge graphs that are then       system responses [1, 8, 12, 13, 36].
used to improve the generated utterances. A number
of works also makes use of pretrained language models
like BERT [33] and subsequently fine-tune them using            3. Data Annotation Methodology
the recommendation dialogs, see, e.g., [34]. A related
approach was adapted by the authors of INSPIRED, in             During the creation of the INSPIRED [1] dataset, items
which they proposed two variants of a conversational            and other entities were annotated in an automated way,
system, with and without strategy labels.                       as described above. For example, genre keywords were
   Unlike generation-based systems, in retrieval-based          annotated using a regular expression to match a set of
CRS the idea is to retrieve and adapt suitable responses        predefined tokens. Regarding actors and directors and
from the dataset of recorded dialogs. One main advan-           other entities, a pattern matching technique was used,
tage of retrieval-based approaches is that the retrieved        where words starting with a capital letter were searched
responses were genuinely made by humans and thus                in the TMDB database2 . A similar technique was used for
are grammatically usually correct and in themselves se-         movie titles. However, as mentioned, we observe a large
mantically meaningful [35]. Recent examples of such             number of cases where items and entities were either
retrieval-based systems are RB-CRS [17] and CRB-CRS             wrongly annotated or missing annotations. To answer
[36], which we designed and evaluated based on the Re-          our research question on the impact of the quality of the
Dial dataset in our own previous work.                          2
                                                                    https://www.themoviedb.org
underlying data on the quality of the responses of a CRS,        Observed Issues During the annotation process, we
we fixed the annotations as follows.                             recorded the observed issues in the original annotations.
                                                                 Since the original annotations were created using auto-
Procedure To fix the annotations, we interviewed a               matic techniques, many issues were related to the limi-
number of university students to assess their knowledge          tations of the simple keyword or pattern matching tech-
in the movies domain and their ability to do the correction      niques. Overall, we observed a number of cases where
task. Subsequently, we hired two students and instructed         minor spelling mistakes or incomplete movie titles made
them on how to annotate and clean the dataset. First,            the exact string matching approaches ineffective.
they were briefed on the logical format of the original             For example, in one of the utterances, “I think I am
annotations and how to retain that format. Second, they          waiting for Star Wars The Rise of Skywalker”, the annota-
were asked to read each utterance individually, to detect        tion was missing because the correct title is “Star Wars:
potential noise, and to analyze which items or entities          Episode IX – The Rise of Skywalker”. Similarly, we ob-
(e.g., title, genre, actor, or director) are mentioned in it.    serve a significant number of cases where an utterance
   In case of ambiguity or obscurity, they were allowed          was only partially annotated, e.g., “ok is it scary like in-
to access online portals, e.g., IMDb3 . Note that regarding      cidious or [ MOVIE_GENRE_2] [ MOVIE_TITLE_5]”. In
the genres, a set of 27 keywords was provided to them,           addition, at places where two entities were separated
which we curated and used in our earlier research [36].          with ‘/’ instead of a space, the automatic technique often
After the briefing, the dataset was split evenly for both        failed to create proper annotations, e.g., “Since you like
annotators. On weekly basis, their performance and the           [MOVIE_GENRE_1] drama/mystery, I’m going to send you
accuracy of the annotations was checked by one of the            the trailer to the movie [MOVIE_TITLE_3]”.
authors. Finally, after annotating the complete dataset, a          Also, the automatic approach used for INSPIRED some-
number of additional validation steps were applied.              times had difficulties to deal with ambiguity. We found
   First, using a Python script, we ensured that every           a number of cases where a regular word was annotated,
placeholder is enclosed by ‘[’ and ‘]’ as was done origi-        although such a word did not belong to any item or en-
nally, e.g., [MOVIE_TITLE_1]. Second, another thorough           tity. For example, in one of the cases, “Are you interested
manual examination of the entire improved dataset was            in a current movie in the box office? ”, the utterance was
performed to fix any missing annotations or noise. In            annotated as “Are you interested in a current movie in
that context, we also double-checked the consistency of          the box [ MOVIE_TITLE_0]”, where the word ‘office’ was
the format and of the annotations.                               mistakenly annotated as an item, i.e., The Office (2005).
                                                                    Overall, the main observed issues are the following.
The INSPIRED2 Dataset In total, 1,851 new annota-                    1. Missing annotations for movie titles, genres, ac-
tions were added to INSPIRED, leading to the INSPIRED2                  tors, movie plots, etc.
dataset. The most mistakes or inconsistencies were found             2. Partially annotated items and entities such as
for the items, i.e., movie titles, which is the most pertinent          movie titles, or genres in an utterance.
information for developing a CRS. We present the statis-             3. Factually wrong annotations for movie titles.
tics about new annotations in Table 2. Overall, we added
                                                                     4. Inconsistent indexing for the annotated items and
around 20% new annotations in INSPIRED2. The number
                                                                        entities.
of issues that were fixed, e.g., duplicate annotations in
                                                                     5. Mistaken annotations for plain text, e.g., family,
an utterance, noise or factually wrong information in
                                                                        box office; human annotations may be required
the original annotations, are not shown in the presented
                                                                        here.
statistics. We release the INSPIRED2 both in the TSV and
JSON format online.                                                  6. Parts of the utterance or a few keywords were
                                                                        omitted during the annotation process.
Table 2
Statistics about new annotations added in INSPIRED2              4. Evaluation Methodology
                                        Total    % Increase
                                                                 We performed both offline experiments as well as a hu-
    Number of movie titles               966        22.0
                                                                 man evaluation to assess the impact of data quality on
    Number of movie genres               206         5.0
    Number of actors, directors, etc.    519        49.0
                                                                 the quality of the responses of a CRS.
    Number of movie plots                160        54.6
    Number of new annotations           1851        18.9         Offline Evaluation of Recommendation Quality
                                                                 We included the following recent end-to-end learning
                                                                 approaches in our experiments: DeepCRS [3], KGSF [12],
3
    https://www.imdb.com/
TG-ReDial [22], and the INSPIRED model without strat-                   5. Results
egy labels 4 [1]. This selection of models covers vari-
ous design approaches for CRS, e.g., using an additional                Recommendation Quality Table 3 shows the accu-
knowledge graph or not. We used the open-source toolkit                 racy results for the evaluated CRS models. Specifically,
CRSLab5 for our evaluations. This framework was used                    we provide the results for the different benchmark CRS
in earlier research as well, for example in [10, 39, 40].               models in terms of the performance difference when us-
For our analyses, we first trained the aforementioned                   ing the original and improved annotations. Overall, we
CRS models using the original split ratio, i.e., 8:1:1, for             can observe an almost consistent gain in performance for
each dataset. Afterwards, given the trained models and                  all models and on all metrics except Hit@50 when the im-
test data for each dataset, we ran three trials for each                proved dataset is used. The obtained improvements can
CRS and subsequently averaged the results for offline                   be quite substantial, indicating that improved data qual-
evaluation metrics. Note that the same procedure was                    ity can be helpful for CRS of different types, including (i)
adapted for both versions of the dataset, i.e., INSPIRED                CRS, which do not rely on additional knowledge sources,
and INSPIRED2.                                                          (ii) CRS that leverage additional knowledge sources, (iii)
                                                                        CRS that are guided by a topic policy, and (iv) CRS that
User Study on Linguistic Quality We conduct a user                      rely on pre-trained language models like BERT.
study to compare the perceived quality of system re-                        Interestingly, we see negative effects for two measure-
sponses using either INSPIRED and INSPIRED2. Specif-                    ments in which Hit@50 is used as a metric. A deeper
ically, we randomly sampled same 50 dialog situations                   investigation of this phenomenon is needed, in particular
from each dataset. To create the dialog continuations,                  as the other metrics at this (admittedly rather uncommon)
we used the retrieval-based CRS approaches, RB-CRS and                  list length, MRR@50 and NDCG@50, indicate that the
CRB-CRS, which we proposed in our earlier work, see                     improved dataset is helpful to increase recommendation
[36].                                                                   accuracy. At the moment, we can only speculate that
   In order to obtain fine-grained assessments, three hu-               the improved annotations in the ongoing dialog histories
man judges6 were involved. The specific task of the                     led to more diverse or niche recommendations compared
judges was to assess (rate) the meaningfulness of a sys-                to the original dataset. We might assume that the miss-
tem response as a proxy of its quality and consistency in a             ing annotations in many cases referred to less popular
dialog situation, see [3, 41, 12]. Note that in this study we           movies, so that the recommendations without the im-
did not explicitly assess the quality of the specific item              proved annotations will more often recommend popular
recommendations. Instead, the focus of this study was                   movies, which is commonly advantageous in terms of hit
to understand the impact of the improved underlying                     rate and recall.
dataset on the linguistic quality and the consistency of
the generated responses.                                                Linguistic Quality We recall that three human eval-
   We used a 3-point scale for these ratings, from ‘Com-                uators assessed the linguistic quality of the system re-
pletely meaningless (1)’ to ‘Somewhat meaningless and                   sponses (dialog continuations), which were created either
meaningful (2)’ to ‘Completely meaningful (3)’. The hu-                 based on the INSPIRED or the INSPIRED2 dataset. As un-
man judges were provided with specific instructions on                  derlying CRS systems, we considered the retrieval-based
how to evaluate the meaningfulness of a response, e.g.,                 approaches RB-CRS and CRB-CRS, as mentioned above.
they should assess if a response represents a logical dialog            For our analysis, we averaged the scores by the three
continuation and evaluate the overall language quality                  evaluators. Table 4 shows the mean ratings across all
of the given response. Overall, the human judges were                   dialog situations as well as the standard deviations. We
provided 50 dialogs (446 responses to rate) that were pro-              find that also in the case of retrieval-based approaches,
duced using the INSPIRED and INSPIRED2 datasets. We                     improving the quality of the underlying dataset was help-
also explained the meanings and purpose of various place-               ful, leading to higher mean scores, without observing
holders contained in the responses to the human judges.                 larger standard deviations. A Student’s t-test reveals that
Moreover, to avoid any bias in the evaluation process, the              the observed differences in the means are statistically
judges were not made aware which response was created                   significant (p<0.001). 7
for which dataset by which CRS. Also, the order of the
dialogs and the system responses were randomized.                       Comparison of Knowledge Concepts in Responses
                                                                        To understand the impact of the new annotations on the
                                                                        responses in terms of the richness of knowledge con-
4
                                                                        cepts, we compute the number of items and entities that
  The INSPIRED with strategy labels model was not publicly available.
5
  https://github.com/RUCAIBox/CRSLab
6
  These judges were PhD students and were different than the ones
                                                                        7
  who fixed the annotations.                                                We provide the data and compiled results of our study online.
Table 3
Accuracy results obtained in the offline evaluation. V1 represents INSPIRED, V2 denotes INSPIRED2, and “% Change” represents
the actual performance gain/loss when using INSPIRED2 compared to INSPIRED.
                          Hit@1     Hit@10    Hit@50    MRR@1      MRR@10     MRR@50     NDCG@1     NDCG@10      NDCG@50
                 V1        0.0006    0.0464    0.1726    0.0065     0.0148     0.0193     0.0065      0.0220       0.0478
 DeepCRS
                 V2        0.0256    0.0578    0.1222    0.0256     0.0306     0.0333     0.0256      0.0366       0.0504
 [3]
              % Change    4161.11    24.50     -29.20    294.95     106.99     72.48      294.95      66.08         5.31
                 V1        0.0022    0.0216    0.0744    0.0032     0.0061     0.0084     0.0022      0.0097       0.0211
 KGSF [12]       V2        0.0066    0.0303    0.0587    0.0057     0.0123     0.0134     0.0066      0.0165       0.0223
              % Change     207.27    40.46     -21.11     75.58     100.97     58.40      207.27      70.36         5.78
                 V1        0.0365    0.1149    0.2344    0.0365     0.0572     0.0626     0.0365      0.0707       0.0967
 TG-ReDial
                 V2        0.0511    0.1315    0.2417    0.0511     0.0742     0.0792     0.0511      0.0877       0.1118
 [22]
              % Change      40.00    14.46     03.12     40.00      29.64      26.48      40.00       24.05        15.51
                 V1        0.0151    0.0550    0.1532    0.0151     0.0241     0.0286     0.0151      0.0312       0.0527
 INSPIRED
                 V2        0.0194    0.0734    0.1855    0.0194     0.0293     0.0353     0.0194      0.0392       0.0650
 [1]
              % Change      28.57    33.33     21.13     28.57      21.59      23.44      28.57       25.44        23.28




Table 4                                                           the BLEU scores generally improve when the underlying
Results of Human Evaluation                                       data quality is higher, i.e., in the case of the INSPIRED2
                               INSPIRED       INSPIRED2
                                                                  dataset. These findings are thus well aligned with the
 RB-CRS      Average score        2.30           2.46             outcomes of our human evaluation study, where using
             Std. deviation       0.62           0.59             INSPIRED2 as an underlying dataset turned out to be
 CRB-        Average score        2.31           2.46             favorable.
 CRS         Std. deviation       0.55           0.55

                                                                  6. Conclusion
appeared in the system responses. Specifically, we com-           Datasets containing recorded dialogs between humans
pute the number of placeholders in the responses, before          are the basis for many modern CRS. In this work, we
they would be replaced by the recommendation compo-               have analyzed the recent INSPIRED dataset, which was
nent, see [36]. In Table 5, we present the statistics for         developed to build the next generation of sociable CRS.
RB-CRS and CRB-CRS for both dataset versions. Over-               We found that automatic entity and concept labeling has
all, we find that the responses for the improved dataset          its limitations and we have improved the quality of the
contain between 20% and 27% more concepts and enti-               dataset through a manual process. We then conducted
ties. We note that an increase in concepts is expected, as        both computational experiments as well as experiments
INSPIRED2 has almost 20% more annotations. However,               with users to analyze to what extent improved data qual-
the important observation here is that the retrieval-based        ity impacts recommendation accuracy and the quality
CRS approaches actually surfaced these richer system              perception of the system’s responses by users. The analy-
responses frequently.                                             ses clearly indicate the benefits of improved data quality
                                                                  across different technical approaches for building CRS.
Table 5                                                           We release the improved dataset publicly and hope to
Number of Items and Entities included in Responses                thereby stimulate more research in sociable conversa-
                                                                  tional recommender systems in the future.
               INSPIRED       INSPIRED2       % Increase
  RB-CRS          174            222             27.6
  CRB-CRS         208            251             20.7



BLEU Score Analysis Finally, in order to understand
to what extent (offline) linguistic scores correlate with
the perceived quality of responses as was done in [1, 8],
we performed an analysis of the BLEU scores obtained
for the different datasets. Specifically, given a system
response and the corresponding ground truth response,
we preprocess both sentences and compute the BLEU
scores for 𝑁 = {1, 2, 3, 4} grams. We provide the results
of this analysis online. In sum, the analysis shows that
References                                                      [15] J. S. Grosman, P. H. Furtado, A. M. Rodrigues, G. G.
                                                                     Schardong, S. D. Barbosa, H. C. Lopes, Eras: Improv-
 [1] S. A. Hayati, D. Kang, Q. Zhu, W. Shi, Z. Yu, IN-               ing the quality control in the annotation process
     SPIRED: Toward sociable recommendation dialog                   for natural language processing tasks, Information
     systems, in: EMNLP ’20, 2020.                                   Systems 93 (2020) 101553.
 [2] F. Pecune, L. Callebert, S. Marsella, A socially-aware     [16] B. C. Benato, J. F. Gomes, A. C. Telea, A. X. Falcão,
     conversational recommender system for personal-                 Semi-automatic data annotation guided by feature
     ized recipe recommendations, in: Proceedings of                 space projection, Pattern Recognition 109 (2021)
     the 8th International Conference on Human-Agent                 107612.
     Interaction, HAI ’20, 2020, p. 78–86.                      [17] A. Manzoor, D. Jannach, Generation-based vs.
 [3] R. Li, S. E. Kahou, H. Schulz, V. Michalski, L. Charlin,        retrieval-based conversational recommendation: A
     C. Pal, Towards deep conversational recommenda-                 user-centric comparison, in: RecSys ’21, 2021.
     tions, in: NIPS ’18, 2018, pp. 9725–9735.                  [18] D. Jannach, A. Manzoor, W. Cai, L. Chen, A survey
 [4] D. Jannach, L. Chen, Conversational Recommen-                   on conversational recommender systems, ACM
     dation: A Grand AI Challenge, AI Magazine 43                    Computing Surveys 54 (2021) 1–36.
     (2022).                                                    [19] C. Gao, W. Lei, X. He, M. de Rijke, T.-S. Chua, Ad-
 [5] M. Di Bratto, M. Di Maro, A. Origlia, F. Cutugno,               vances and challenges in conversational recom-
     Dialogue analysis with graph databases: Character-              mender systems: A survey, AI Open 2 (2021)
     ising domain items usage for movie recommenda-                  100–126.
     tions (2021).                                              [20] W. Cai, L. Chen, Predicting user intents and satis-
 [6] C.-M. Wong, F. Feng, W. Zhang, C.-M. Vong,                      faction with dialogue-based conversational recom-
     H. Chen, Y. Zhang, P. He, H. Chen, K. Zhao, H. Chen,            mendations, in: UMAP ’20, 2020, p. 33–42.
     Improving conversational recommender system                [21] D. Kang, A. Balakrishnan, P. Shah, P. Crook, Y.-L.
     by pretraining billion-scale knowledge graph, in:               Boureau, J. Weston, Recommendation as a commu-
     ICDE ’21, 2021, pp. 2607–2612.                                  nication game: Self-supervised bot-play for goal-
 [7] Y. Cao, X. Wang, X. He, Z. Hu, T.-S. Chua, Unifying             oriented dialogue, in: EMNLP-IJCNLP ’19, 2019, pp.
     knowledge graph learning and recommendation:                    1951–1961.
     Towards a better understanding of user preferences,        [22] K. Zhou, Y. Zhou, W. X. Zhao, X. Wang, J.-R. Wen,
     in: WWW ’19, 2019, pp. 151–161.                                 Towards topic-guided conversational recommender
 [8] Q. Chen, J. Lin, Y. Zhang, M. Ding, Y. Cen, H. Yang,            system, in: ICCL ’20, 2020, pp. 4128–4139.
     J. Tang, Towards knowledge-based recommender               [23] Z. Fu, Y. Xian, Y. Zhu, Y. Zhang, G. de Melo,
     dialog system, in: EMNLP-IJCNLP ’19, 2019, pp.                  COOKIE: A dataset for conversational recommen-
     1803–1813.                                                      dation over knowledge graphs in e-commerce, 2020.
 [9] J. Zhou, B. Wang, R. He, Y. Hou, CRFR: Improv-                  arXiv:2008.09237.
     ing conversational recommender systems via flexi-          [24] K. Christakopoulou, F. Radlinski, K. Hofmann, To-
     ble fragments reasoning on knowledge graphs, in:                wards conversational recommender systems, in:
     EMNLP ’21, 2021, pp. 4324–4334.                                 KDD ’16, 2016, pp. 815–824.
[10] K. Chen, S. Sun, Knowledge-based conversational            [25] X. Ren, H. Yin, T. Chen, H. Wang, Z. Huang,
     recommender systems enhanced by dialogue policy                 K. Zheng, Learning to ask appropriate questions
     learning, in: IJCKG ’21, 2021, pp. 10–18.                       in conversational recommendation, in: SIGIR ’21,
[11] A. Wang, C. D. V. Hoang, M.-Y. Kan, Perspectives                2021, pp. 808–817.
     on crowdsourcing annotations for natural language          [26] P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Ana-
     processing, Language resources and evaluation 47                niadou, J. Tsujii, Brat: a web-based tool for nlp-
     (2013) 9–31.                                                    assisted text annotation, in: ACL ’12, 2012, pp.
[12] K. Zhou, W. X. Zhao, S. Bian, Y. Zhou, J.-R. Wen,               102–107.
     J. Yu, Improving conversational recommender sys-           [27] C. Zong, R. Xia, J. Zhang, Data Annotation and
     tems via knowledge graph based semantic fusion,                 Preprocessing, 2021, pp. 15–31.
     in: KDD ’20, 2020, pp. 1006–1014.                          [28] P. Röttger, B. Vidgen, D. Hovy, J. B. Pierrehumbert,
[13] Y. He, L. Liao, Z. Zhang, T.-S. Chua, Towards en-               Two contrasting data annotation paradigms for sub-
     riching responses with crowd-sourced knowledge                  jective nlp tasks, 2021. a r X i v : 2 1 1 2 . 0 7 4 7 5 .
     for task-oriented dialogue, in: MuCAI ’21, 2021, p.        [29] D. Jannach, ADVISOR SUITE – A knowledge-based
     3–11.                                                           sales advisory system, in: ECAI ’04, 2004, pp.
[14] T. Arjannikov, C. Sanden, J. Z. Zhang, Verifying                720–724.
     tag annotations through association analysis, in:          [30] K. McCarthy, Y. Salem, B. Smyth, Experience-based
     ISMIR ’13, 2013, pp. 195–200.                                   critiquing: Reusing critiquing experiences to im-
     prove conversational recommendation, in: ICCBR               conversational recommendation, Information Sys-
     ’10, 2010, pp. 480–494.                                      tems (2022) 102083.
[31] L. Chen, P. Pu, Critiquing-based recommenders:          [37] D. Jannach, Evaluating conversational recom-
     survey and emerging trends, User Modeling and                mender systems, Artificial Intelligence Review
     User-Adapted Interaction 22 (2012) 125–150.                  forthcoming (2022).
[32] Q. Chen, J. Lin, Y. Zhang, H. Yang, J. Zhou, J. Tang,   [38] T. Zhang, Y. Liu, P. Zhong, C. Zhang, H. Wang,
     Towards knowledge-based personalized product de-             C. Miao, KECRS: Towards knowledge-enriched
     scription generation in e-commerce, in: KDD ’19,             conversational recommendation system, 2021.
     2019, pp. 3040–3050.                                         arXiv:2105.08261.
[33] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert:     [39] Y. Zhou, K. Zhou, W. X. Zhao, C. Wang, P. Jiang,
     Pre-training of deep bidirectional transformers for          H. Hu, C2 -crs: Coarse-to-fine contrastive learn-
     language understanding, in: NAACL-HLT, 2019.                 ing for conversational recommender system, in:
[34] L. Wang, H. Hu, L. Sha, C. Xu, K.-F. Wong, D. Jiang,         WSDM ’22, 2022, pp. 1488–1496.
     Finetuning large-scale pre-trained language models      [40] Y. Li, B. Peng, Y. Shen, Y. Mao, L. Liden, Z. Yu,
     for conversational recommendation with knowl-                J. Gao, Knowledge-grounded dialogue generation
     edge graph, 2021. a r X i v : 2 1 1 0 . 0 7 4 7 7 .          with a unified knowledge representation, 2021.
[35] A. Manzoor, D. Jannach, Conversational recom-                arXiv:2112.07924.
     mendation based on end-to-end learning: How far         [41] D. Jannach, A. Manzoor, End-to-end learning for
     are we?, Computers in Human Behavior Reports                 conversational recommendation: A long way to
     (2021) 100139.                                               go?, in: IntRS Workshop at RecSys ’20, Online,
[36] A. Manzoor, D. Jannach, Towards retrieval-based              2020.