=Paper= {{Paper |id=Vol-3294/long5 |storemode=property |title=INSPIRED2: An Improved Dataset for Sociable Conversational Recommendation |pdfUrl=https://ceur-ws.org/Vol-3294/long5.pdf |volume=Vol-3294 |authors=Ahtsham Manzoor,Dietmar Jannach |dblpUrl=https://dblp.org/rec/conf/recsys/ManzoorJ22a }} ==INSPIRED2: An Improved Dataset for Sociable Conversational Recommendation== https://ceur-ws.org/Vol-3294/long5.pdf

INSPIRED2: An Improved Dataset for Sociable
Conversational Recommendation
Ahtsham Manzoor∗ , Dietmar Jannach
University of Klagenfurt, Universitätsstraße 65-67, Klagenfurt am Wörthersee, 9020, Austria

Abstract
Conversational recommender systems (CRS) that are able to interact with users in natural language often utilize recommenda-
tion dialogs which were previously collected with the help of paired humans, where one plays the role of a seeker and the other
as a recommender. These recommendation dialogs include items and entities that indicate the users’ preferences. In order to
precisely model the seekers’ preferences and respond consistently, CRS typically rely on item and entity annotations. A recent
example of such a dataset is INSPIRED, wich consists of recommendation dialogs for sociable conversational recommendation,
where items and entities were annotated using automatic keyword or pattern matching techniques. An analysis of this dataset
unfortunately revealed that there is a substantial number of cases where items and entities were either wrongly annotated or
annotations were missing at all. This leads to the question to what extent automatic techniques for annotations are effective.
Moreover, it is important to study impact of annotation quality on the overall effectiveness of a CRS in terms of the quality
of the system’s responses. To study these aspects, we manually fixed the annotations in INSPIRED. We then evaluated the
performance of several benchmark CRS using both versions of the dataset. Our analyses suggest that the improved version of
the dataset, i.e., INSPIRED2, helped increase the performance of several benchmark CRS, emphasizing the importance of
data quality both for end-to-end learning and retrieval-based approaches to conversational recommendation. We release our
improved dataset (INSPIRED2) publicly at https://github.com/ahtsham58/INSPIRED2.

Keywords
Conversational Recommender Systems, data quality, annotations, evaluation, dialog systems

1. Introduction concepts that appear in the dialogs. In the movies do-
main, for example, being able to exactly identifying the
Sociable conversational recommender systems (CRS) aim items (i.e., movies) and related entities and concepts (e.g.,
to build rapport with users while interacting with them in actors or genres) can play a pivotal role for building an
natural language [1, 2]. CRS that rely on natural language effective system. Existing CRS for example arrange such
processing (NLP) nowadays commonly utilize datasets of entities and their relationships as graphs [5, 6], and these
previously recorded dialogs between humans, where one relationships often form the basis to model the users’
plays the role of a recommendation-seeker and the other preferences, e.g., [7, 8, 9]. Moreover, domain specific
as human-recommender, see e.g., [3]. However, due to a concepts and entities can also contribute to the genera-
certain lack of rich sociable interactions in such datasets tion of meaningful and coherent responses, especially in
[4], it can be challenging to build a sociable CRS that knowledge-aware CRS, see [10, 11, 12, 13].
builds rapport with the users using such limited data. Annotating items and entities can be a laborious and
Therefore, it is important to develop datasets like IN- economically expensive process [14, 15]. Human costs
SPIRED [1], which includes dialogs that implement rich are high and may even be prohibitive for domains where
social communication strategies. Such rich datasets rep- particular knowledge or expertise are required to accom-
resent a solid basis to develop trustable CRS that are able plish the annotation task [16]. In that context, the qual-
to engage users in a natural and user-adaptive manner. ity of the resulting annotations is crucial, and factually
Another key factor for building high-quality CRS lies in wrong annotations can lead to errors or ambiguity for
the proper recognition of the named entities and other the downstream task. Automating the annotation task or
at least automatically verifying the annotations [14] has
4th Edition of Knowledge-aware and Conversational Recommender Sys- therefore been in the focus of research for several years.
tems (KaRS) Workshop @ RecSys 2022, September 18–23 2023, Seattle, We note here that data quality is crucial both for recent
WA, USA.
∗
Corresponding author.
generation-based CRS approaches as well as for retrieval-
Envelope-Open ahtsham.manzoor@aau.at (A. Manzoor); based approaches to build natural language conversa-
dietmar.jannach@aau.at (D. Jannach) tional systems [17]. For both types of systems, the ques-
GLOBE https://ahtsham58.github.io/ (A. Manzoor); https://www.aau.at/ tion arises to what extent better data quality, i.e., having
en/aics/research-groups/infsys/team/dietmar-jannach/ (D. Jannach) correct annotations and noise-free conversations, leads
Orcid 0000-0001-9418-753 (A. Manzoor); 0000-0002-4698-8507
to better results in terms of the quality of the responses
(D. Jannach)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License returned by a system for a given user utterance, e.g., in
Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings CEUR Workshop Proceedings (CEUR-WS.org)
http://ceur-ws.org
ISSN 1613-0073
terms of consistency and plausibility.
In this work, we study the recent INSPIRED dataset, A number of new datasets for conversational recom-
in which the items and entities that were mentioned in mendation were published in recent years, e.g., [3, 21,
the recorded utterances are explicitly annotated. These 22, 23]. Such datasets, which are commonly collected
annotations were created with the help of automatic ap- with the help of crowdworkers, can however have limi-
proaches using keyword or pattern matching methods. tations and may be not fully representative in terms of
However, looking at the data, we observed a substantial what we would observe in reality. In some cases, for
number of cases where items and entities were either example, crowdworkers were instructed to mention a
wrongly annotated or missing annotations at all, e.g., minimum number of movies in the conversations. This
“My favorite [ MOVIE_GENRE_1] are Groundhogs Day, leads to mostly “instance-based” conversations, where
[ MOVIE_TITLE_2] and Borat”. In addition, there were crowdworkers rather mention individual movies they
several cases where the utterances included noise, e.g., like than their preferred genres, see also, [3, 24, 25].
“How did you like QUOTATION_MARKHustlersQUOTA- Another problem when creating such datasets lies in
TION _MARK?”. Finally, we found instances where regu- the recognition and annotation of named entities appear-
lar words were identified as being named entities. In this ing in the conversations, as mentioned above. Annotat-
latter case, human annotations would in fact have been ing entities in textual data can be a tedious process that
required.1 Overall, such issues may limit the quality of may require a substantial amount of manual effort and
any CRS that is built on top of such data. time. To overcome this challenge, researchers sometimes
To understand the severity of the problem and the po- adapt a semi-automatic approach or rely on NLP-assisted
tential effects of data issues on the quality of a CRS, we tools that visualize the entities in a text in order to re-
have manually corrected the dataset by fixing the annota- duce the required manual effort [16, 15, 26]. Generally,
tions and by removing noise from the utterances. Then, some automatic approaches may experience problems to
we conducted offline experiments and human evalua- correctly create annotations because human judgments
tions to compare the performance of different benchmark and opinions are required. An automated approach was
CRS when using the original (INSPIRED) and improved used in the context of the INSPIRED dataset. Here, the
(INSPIRED2) datasets. Overall, the results of our anal- items and entities were annotated using keyword or pat-
yses indicate that all CRS showed better performance tern matching approaches. However, verifying the out-
in different dimensions when built on INSPIRED2. In comes of such automatic or semi-automatic approaches
order to facilitate the design and development of future can again be laborious and require manual effort.
sociable CRS, we release the INSPIRED2 dataset online Today, structured annotations for items and entities
at https://github.com/ahtsham58/INSPIRED2. mentioned in the conversations are common in recent
datasets. For example, in the case of the ReDial dataset
[3], the mentioned movie titles were annotated with
2. Related Work unique IDs. However, the ReDial dataset has some limi-
tations. Various meta-data concepts (e.g., genres, actors,
In this section, we first discuss datasets and aspects of
or directors) were not annotated. Moreover, the recorded
data quality in the context of CRS. Afterwards, we review
dialogs include limited social interactions or explanations
different design paradigms for building CRS, followed by
for the made recommendations. On the other hand, the
a discussion of predominant evaluation approaches for
INSPIRED dataset includes rich sociable conversation
such systems.
and explanation strategies for the recommended items.
Also, aspects like movie genres or actors were explicitly
Datasets and Data Quality Research interest in CRS annotated too. A comparison of these differences can be
has experienced a substantial growth in recent years, found in [1]. The key statistics of the INSPIRED dataset
see [18, 19] for related surveys. Many current systems are shown in Table 1.
interact with users in natural language, and one impor- As mentioned earlier, the INSPIRED dataset has some
tant goal for such system is to enable them to engage in limitations. The keyword or pattern matching approach
conversations that reflect human behavior. Since many used for the annotations might for example not detect
of these recent systems are built on recorded dialogs misspelled keywords or concepts in an utterance. More-
between humans, the capabilities of the resulting CRS over, data anomalies such as noisy utterances or ill-
depend on the richness of the communication in the formed language can deteriorate the performance of an
datasets, e.g., in terms of the user intents that can be annotating algorithm, leading to challenges for the down-
found in the conversations, see [20] for an detailed anal- stream use of the dataset [15, 27, 28]. In reality, the level
ysis of such intents. of noise can be substantial both in real-world applica-
tions and in purposefully created datasets. Therefore,
1
Consider the movie “It (2017)” as an example of a difficult case, e.g., data quality assurance is often considered a significant
when appearing in an utterance like “Have you seen It?”. and important step in NLP applications.
Table 1 CRS Evaluation Evaluating a CRS is a multi-faceted
Main Statistics of INSPIRED and challenging problem as it requires the consideration
Total
of various quality dimensions. An in-depth discussion
of evaluation approaches for CRS can be found in [37].
Number of dialogs (conversations) 1,001
Like in the recommender systems literature in general,
Average turns per dialog 10.73
computational experiments that do not involve humans
Average tokens per utterance 7.93
in the loop are the predominant instrument to assess
Number of human-recommender utterances 18,339
the quality of a CRS. Common metrics to evaluate the
Number of seeker utterances 17,472 quality of the recommendations include Recall, Hit Rate,
or Precision [8, 21, 38]. Moreover, certain linguistic as-
pects such as fluency or diversity are often evaluated
Building Conversational Recommender Systems with offline experiments as well to assess the quality of
Research on CRS has made substantial progress in terms the generated responses. Common metrics in this area
of their underlying technical approaches. Some early include Perplexity, distinct N-Gram, or the BLEU score
commercial system such as Advisor Suite [29] for example [3, 8, 13, 22].
relied on an entirely knowledge-based approach for the Given the interactive nature of CRS, offline experi-
development of adaptive and personalized applications. ments and the corresponding metrics have their limi-
Similarly, early critiquing-based systems were based on tations. Mainly, it is not always clear if the results ob-
detailed knowledge about item features and possible cri- tained from offline experiments are representative of the
tiques and had limited learning capabilities [30, 31]. user-perceived quality of the recommendations or sys-
Technological advancements particularly in fields like tem responses in general [35]. For example, when using
NLP, speech recognition, and machine learning in general metrics like the BLEU score, usually a system response is
led to the design of today’s end-to-end learning-based compared with one particular given ground truth. Such
CRS. In such approaches, recorded recommendation di- a comparison has limitations when used to estimate the
alogs between paired humans are used to train the deep average quality of a system’s responses, because there
neural models, see, e.g., [8, 9, 10, 12]. Given the last user might be many different alternative responses that might
utterance and the history of the ongoigng dialog history, be suitable as well in an ongoing dialog. Still, offline
these trained models are then used to generate responses evaluations have their place and value. They can for
in natural language. These responses can either include example be informative for assessing particular aspects
item recommendations, which are also computed with such as the number of items or entities that appear in an
the help of machine learning techniques, or other types utterance or conversation.
of conversational elements, e.g., greetings. Overall, given the limitations of pure offline experi-
In terms of the underlying data, the DeepCRS [3] sys- ments, researchers often follow a mixed approach where
tem was built on the ReDial dataset, which was created some aspects of the system are evaluated offline and some
in the context of this work. Later on, systems were devel- with humans. Typical quality aspects in terms of human
oped which also relied on this dataset as well but included perceptions in such combined approaches include the
additional information sources, e.g., from DBPedia or Con- assessment of the meaningfulness or consistency of the
ceptNet [32, 12], to build knowledge graphs that are then system responses [1, 8, 12, 13, 36].
used to improve the generated utterances. A number
of works also makes use of pretrained language models
like BERT [33] and subsequently fine-tune them using 3. Data Annotation Methodology
the recommendation dialogs, see, e.g., [34]. A related
approach was adapted by the authors of INSPIRED, in During the creation of the INSPIRED [1] dataset, items
which they proposed two variants of a conversational and other entities were annotated in an automated way,
system, with and without strategy labels. as described above. For example, genre keywords were
Unlike generation-based systems, in retrieval-based annotated using a regular expression to match a set of
CRS the idea is to retrieve and adapt suitable responses predefined tokens. Regarding actors and directors and
from the dataset of recorded dialogs. One main advan- other entities, a pattern matching technique was used,
tage of retrieval-based approaches is that the retrieved where words starting with a capital letter were searched
responses were genuinely made by humans and thus in the TMDB database2 . A similar technique was used for
are grammatically usually correct and in themselves se- movie titles. However, as mentioned, we observe a large
mantically meaningful [35]. Recent examples of such number of cases where items and entities were either
retrieval-based systems are RB-CRS [17] and CRB-CRS wrongly annotated or missing annotations. To answer
[36], which we designed and evaluated based on the Re- our research question on the impact of the quality of the
Dial dataset in our own previous work. 2
https://www.themoviedb.org
underlying data on the quality of the responses of a CRS, Observed Issues During the annotation process, we
we fixed the annotations as follows. recorded the observed issues in the original annotations.
Since the original annotations were created using auto-
Procedure To fix the annotations, we interviewed a matic techniques, many issues were related to the limi-
number of university students to assess their knowledge tations of the simple keyword or pattern matching tech-
in the movies domain and their ability to do the correction niques. Overall, we observed a number of cases where
task. Subsequently, we hired two students and instructed minor spelling mistakes or incomplete movie titles made
them on how to annotate and clean the dataset. First, the exact string matching approaches ineffective.
they were briefed on the logical format of the original For example, in one of the utterances, “I think I am
annotations and how to retain that format. Second, they waiting for Star Wars The Rise of Skywalker”, the annota-
were asked to read each utterance individually, to detect tion was missing because the correct title is “Star Wars:
potential noise, and to analyze which items or entities Episode IX – The Rise of Skywalker”. Similarly, we ob-
(e.g., title, genre, actor, or director) are mentioned in it. serve a significant number of cases where an utterance
In case of ambiguity or obscurity, they were allowed was only partially annotated, e.g., “ok is it scary like in-
to access online portals, e.g., IMDb3 . Note that regarding cidious or [ MOVIE_GENRE_2] [ MOVIE_TITLE_5]”. In
the genres, a set of 27 keywords was provided to them, addition, at places where two entities were separated
which we curated and used in our earlier research [36]. with ‘/’ instead of a space, the automatic technique often
After the briefing, the dataset was split evenly for both failed to create proper annotations, e.g., “Since you like
annotators. On weekly basis, their performance and the [MOVIE_GENRE_1] drama/mystery, I’m going to send you
accuracy of the annotations was checked by one of the the trailer to the movie [MOVIE_TITLE_3]”.
authors. Finally, after annotating the complete dataset, a Also, the automatic approach used for INSPIRED some-
number of additional validation steps were applied. times had difficulties to deal with ambiguity. We found
First, using a Python script, we ensured that every a number of cases where a regular word was annotated,
placeholder is enclosed by ‘[’ and ‘]’ as was done origi- although such a word did not belong to any item or en-
nally, e.g., [MOVIE_TITLE_1]. Second, another thorough tity. For example, in one of the cases, “Are you interested
manual examination of the entire improved dataset was in a current movie in the box office? ”, the utterance was
performed to fix any missing annotations or noise. In annotated as “Are you interested in a current movie in
that context, we also double-checked the consistency of the box [ MOVIE_TITLE_0]”, where the word ‘office’ was
the format and of the annotations. mistakenly annotated as an item, i.e., The Office (2005).
Overall, the main observed issues are the following.
The INSPIRED2 Dataset In total, 1,851 new annota- 1. Missing annotations for movie titles, genres, ac-
tions were added to INSPIRED, leading to the INSPIRED2 tors, movie plots, etc.
dataset. The most mistakes or inconsistencies were found 2. Partially annotated items and entities such as
for the items, i.e., movie titles, which is the most pertinent movie titles, or genres in an utterance.
information for developing a CRS. We present the statis- 3. Factually wrong annotations for movie titles.
tics about new annotations in Table 2. Overall, we added
4. Inconsistent indexing for the annotated items and
around 20% new annotations in INSPIRED2. The number
entities.
of issues that were fixed, e.g., duplicate annotations in
5. Mistaken annotations for plain text, e.g., family,
an utterance, noise or factually wrong information in
box office; human annotations may be required
the original annotations, are not shown in the presented
here.
statistics. We release the INSPIRED2 both in the TSV and
JSON format online. 6. Parts of the utterance or a few keywords were
omitted during the annotation process.
Table 2
Statistics about new annotations added in INSPIRED2 4. Evaluation Methodology
Total % Increase
We performed both offline experiments as well as a hu-
Number of movie titles 966 22.0
man evaluation to assess the impact of data quality on
Number of movie genres 206 5.0
Number of actors, directors, etc. 519 49.0
the quality of the responses of a CRS.
Number of movie plots 160 54.6
Number of new annotations 1851 18.9 Offline Evaluation of Recommendation Quality
We included the following recent end-to-end learning
approaches in our experiments: DeepCRS [3], KGSF [12],
3
https://www.imdb.com/
TG-ReDial [22], and the INSPIRED model without strat- 5. Results
egy labels 4 [1]. This selection of models covers vari-
ous design approaches for CRS, e.g., using an additional Recommendation Quality Table 3 shows the accu-
knowledge graph or not. We used the open-source toolkit racy results for the evaluated CRS models. Specifically,
CRSLab5 for our evaluations. This framework was used we provide the results for the different benchmark CRS
in earlier research as well, for example in [10, 39, 40]. models in terms of the performance difference when us-
For our analyses, we first trained the aforementioned ing the original and improved annotations. Overall, we
CRS models using the original split ratio, i.e., 8:1:1, for can observe an almost consistent gain in performance for
each dataset. Afterwards, given the trained models and all models and on all metrics except Hit@50 when the im-
test data for each dataset, we ran three trials for each proved dataset is used. The obtained improvements can
CRS and subsequently averaged the results for offline be quite substantial, indicating that improved data qual-
evaluation metrics. Note that the same procedure was ity can be helpful for CRS of different types, including (i)
adapted for both versions of the dataset, i.e., INSPIRED CRS, which do not rely on additional knowledge sources,
and INSPIRED2. (ii) CRS that leverage additional knowledge sources, (iii)
CRS that are guided by a topic policy, and (iv) CRS that
User Study on Linguistic Quality We conduct a user rely on pre-trained language models like BERT.
study to compare the perceived quality of system re- Interestingly, we see negative effects for two measure-
sponses using either INSPIRED and INSPIRED2. Specif- ments in which Hit@50 is used as a metric. A deeper
ically, we randomly sampled same 50 dialog situations investigation of this phenomenon is needed, in particular
from each dataset. To create the dialog continuations, as the other metrics at this (admittedly rather uncommon)
we used the retrieval-based CRS approaches, RB-CRS and list length, MRR@50 and NDCG@50, indicate that the
CRB-CRS, which we proposed in our earlier work, see improved dataset is helpful to increase recommendation
[36]. accuracy. At the moment, we can only speculate that
In order to obtain fine-grained assessments, three hu- the improved annotations in the ongoing dialog histories
man judges6 were involved. The specific task of the led to more diverse or niche recommendations compared
judges was to assess (rate) the meaningfulness of a sys- to the original dataset. We might assume that the miss-
tem response as a proxy of its quality and consistency in a ing annotations in many cases referred to less popular
dialog situation, see [3, 41, 12]. Note that in this study we movies, so that the recommendations without the im-
did not explicitly assess the quality of the specific item proved annotations will more often recommend popular
recommendations. Instead, the focus of this study was movies, which is commonly advantageous in terms of hit
to understand the impact of the improved underlying rate and recall.
dataset on the linguistic quality and the consistency of
the generated responses. Linguistic Quality We recall that three human eval-
We used a 3-point scale for these ratings, from ‘Com- uators assessed the linguistic quality of the system re-
pletely meaningless (1)’ to ‘Somewhat meaningless and sponses (dialog continuations), which were created either
meaningful (2)’ to ‘Completely meaningful (3)’. The hu- based on the INSPIRED or the INSPIRED2 dataset. As un-
man judges were provided with specific instructions on derlying CRS systems, we considered the retrieval-based
how to evaluate the meaningfulness of a response, e.g., approaches RB-CRS and CRB-CRS, as mentioned above.
they should assess if a response represents a logical dialog For our analysis, we averaged the scores by the three
continuation and evaluate the overall language quality evaluators. Table 4 shows the mean ratings across all
of the given response. Overall, the human judges were dialog situations as well as the standard deviations. We
provided 50 dialogs (446 responses to rate) that were pro- find that also in the case of retrieval-based approaches,
duced using the INSPIRED and INSPIRED2 datasets. We improving the quality of the underlying dataset was help-
also explained the meanings and purpose of various place- ful, leading to higher mean scores, without observing
holders contained in the responses to the human judges. larger standard deviations. A Student’s t-test reveals that
Moreover, to avoid any bias in the evaluation process, the the observed differences in the means are statistically
judges were not made aware which response was created significant (p<0.001). 7
for which dataset by which CRS. Also, the order of the
dialogs and the system responses were randomized. Comparison of Knowledge Concepts in Responses
To understand the impact of the new annotations on the
responses in terms of the richness of knowledge con-
4
cepts, we compute the number of items and entities that
The INSPIRED with strategy labels model was not publicly available.
5
https://github.com/RUCAIBox/CRSLab
6
These judges were PhD students and were different than the ones
7
who fixed the annotations. We provide the data and compiled results of our study online.
Table 3
Accuracy results obtained in the offline evaluation. V1 represents INSPIRED, V2 denotes INSPIRED2, and “% Change” represents
the actual performance gain/loss when using INSPIRED2 compared to INSPIRED.
Hit@1 Hit@10 Hit@50 MRR@1 MRR@10 MRR@50 NDCG@1 NDCG@10 NDCG@50
V1 0.0006 0.0464 0.1726 0.0065 0.0148 0.0193 0.0065 0.0220 0.0478
DeepCRS
V2 0.0256 0.0578 0.1222 0.0256 0.0306 0.0333 0.0256 0.0366 0.0504
[3]
% Change 4161.11 24.50 -29.20 294.95 106.99 72.48 294.95 66.08 5.31
V1 0.0022 0.0216 0.0744 0.0032 0.0061 0.0084 0.0022 0.0097 0.0211
KGSF [12] V2 0.0066 0.0303 0.0587 0.0057 0.0123 0.0134 0.0066 0.0165 0.0223
% Change 207.27 40.46 -21.11 75.58 100.97 58.40 207.27 70.36 5.78
V1 0.0365 0.1149 0.2344 0.0365 0.0572 0.0626 0.0365 0.0707 0.0967
TG-ReDial
V2 0.0511 0.1315 0.2417 0.0511 0.0742 0.0792 0.0511 0.0877 0.1118
[22]
% Change 40.00 14.46 03.12 40.00 29.64 26.48 40.00 24.05 15.51
V1 0.0151 0.0550 0.1532 0.0151 0.0241 0.0286 0.0151 0.0312 0.0527
INSPIRED
V2 0.0194 0.0734 0.1855 0.0194 0.0293 0.0353 0.0194 0.0392 0.0650
[1]
% Change 28.57 33.33 21.13 28.57 21.59 23.44 28.57 25.44 23.28

Table 4 the BLEU scores generally improve when the underlying
Results of Human Evaluation data quality is higher, i.e., in the case of the INSPIRED2
INSPIRED INSPIRED2
dataset. These findings are thus well aligned with the
RB-CRS Average score 2.30 2.46 outcomes of our human evaluation study, where using
Std. deviation 0.62 0.59 INSPIRED2 as an underlying dataset turned out to be
CRB- Average score 2.31 2.46 favorable.
CRS Std. deviation 0.55 0.55

6. Conclusion
appeared in the system responses. Specifically, we com- Datasets containing recorded dialogs between humans
pute the number of placeholders in the responses, before are the basis for many modern CRS. In this work, we
they would be replaced by the recommendation compo- have analyzed the recent INSPIRED dataset, which was
nent, see [36]. In Table 5, we present the statistics for developed to build the next generation of sociable CRS.
RB-CRS and CRB-CRS for both dataset versions. Over- We found that automatic entity and concept labeling has
all, we find that the responses for the improved dataset its limitations and we have improved the quality of the
contain between 20% and 27% more concepts and enti- dataset through a manual process. We then conducted
ties. We note that an increase in concepts is expected, as both computational experiments as well as experiments
INSPIRED2 has almost 20% more annotations. However, with users to analyze to what extent improved data qual-
the important observation here is that the retrieval-based ity impacts recommendation accuracy and the quality
CRS approaches actually surfaced these richer system perception of the system’s responses by users. The analy-
responses frequently. ses clearly indicate the benefits of improved data quality
across different technical approaches for building CRS.
Table 5 We release the improved dataset publicly and hope to
Number of Items and Entities included in Responses thereby stimulate more research in sociable conversa-
tional recommender systems in the future.
INSPIRED INSPIRED2 % Increase
RB-CRS 174 222 27.6
CRB-CRS 208 251 20.7

BLEU Score Analysis Finally, in order to understand
to what extent (offline) linguistic scores correlate with
the perceived quality of responses as was done in [1, 8],
we performed an analysis of the BLEU scores obtained
for the different datasets. Specifically, given a system
response and the corresponding ground truth response,
we preprocess both sentences and compute the BLEU
scores for 𝑁 = {1, 2, 3, 4} grams. We provide the results
of this analysis online. In sum, the analysis shows that
References [15] J. S. Grosman, P. H. Furtado, A. M. Rodrigues, G. G.
Schardong, S. D. Barbosa, H. C. Lopes, Eras: Improv-
[1] S. A. Hayati, D. Kang, Q. Zhu, W. Shi, Z. Yu, IN- ing the quality control in the annotation process
SPIRED: Toward sociable recommendation dialog for natural language processing tasks, Information
systems, in: EMNLP ’20, 2020. Systems 93 (2020) 101553.
[2] F. Pecune, L. Callebert, S. Marsella, A socially-aware [16] B. C. Benato, J. F. Gomes, A. C. Telea, A. X. Falcão,
conversational recommender system for personal- Semi-automatic data annotation guided by feature
ized recipe recommendations, in: Proceedings of space projection, Pattern Recognition 109 (2021)
the 8th International Conference on Human-Agent 107612.
Interaction, HAI ’20, 2020, p. 78–86. [17] A. Manzoor, D. Jannach, Generation-based vs.
[3] R. Li, S. E. Kahou, H. Schulz, V. Michalski, L. Charlin, retrieval-based conversational recommendation: A
C. Pal, Towards deep conversational recommenda- user-centric comparison, in: RecSys ’21, 2021.
tions, in: NIPS ’18, 2018, pp. 9725–9735. [18] D. Jannach, A. Manzoor, W. Cai, L. Chen, A survey
[4] D. Jannach, L. Chen, Conversational Recommen- on conversational recommender systems, ACM
dation: A Grand AI Challenge, AI Magazine 43 Computing Surveys 54 (2021) 1–36.
(2022). [19] C. Gao, W. Lei, X. He, M. de Rijke, T.-S. Chua, Ad-
[5] M. Di Bratto, M. Di Maro, A. Origlia, F. Cutugno, vances and challenges in conversational recom-
Dialogue analysis with graph databases: Character- mender systems: A survey, AI Open 2 (2021)
ising domain items usage for movie recommenda- 100–126.
tions (2021). [20] W. Cai, L. Chen, Predicting user intents and satis-
[6] C.-M. Wong, F. Feng, W. Zhang, C.-M. Vong, faction with dialogue-based conversational recom-
H. Chen, Y. Zhang, P. He, H. Chen, K. Zhao, H. Chen, mendations, in: UMAP ’20, 2020, p. 33–42.
Improving conversational recommender system [21] D. Kang, A. Balakrishnan, P. Shah, P. Crook, Y.-L.
by pretraining billion-scale knowledge graph, in: Boureau, J. Weston, Recommendation as a commu-
ICDE ’21, 2021, pp. 2607–2612. nication game: Self-supervised bot-play for goal-
[7] Y. Cao, X. Wang, X. He, Z. Hu, T.-S. Chua, Unifying oriented dialogue, in: EMNLP-IJCNLP ’19, 2019, pp.
knowledge graph learning and recommendation: 1951–1961.
Towards a better understanding of user preferences, [22] K. Zhou, Y. Zhou, W. X. Zhao, X. Wang, J.-R. Wen,
in: WWW ’19, 2019, pp. 151–161. Towards topic-guided conversational recommender
[8] Q. Chen, J. Lin, Y. Zhang, M. Ding, Y. Cen, H. Yang, system, in: ICCL ’20, 2020, pp. 4128–4139.
J. Tang, Towards knowledge-based recommender [23] Z. Fu, Y. Xian, Y. Zhu, Y. Zhang, G. de Melo,
dialog system, in: EMNLP-IJCNLP ’19, 2019, pp. COOKIE: A dataset for conversational recommen-
1803–1813. dation over knowledge graphs in e-commerce, 2020.
[9] J. Zhou, B. Wang, R. He, Y. Hou, CRFR: Improv- arXiv:2008.09237.
ing conversational recommender systems via flexi- [24] K. Christakopoulou, F. Radlinski, K. Hofmann, To-
ble fragments reasoning on knowledge graphs, in: wards conversational recommender systems, in:
EMNLP ’21, 2021, pp. 4324–4334. KDD ’16, 2016, pp. 815–824.
[10] K. Chen, S. Sun, Knowledge-based conversational [25] X. Ren, H. Yin, T. Chen, H. Wang, Z. Huang,
recommender systems enhanced by dialogue policy K. Zheng, Learning to ask appropriate questions
learning, in: IJCKG ’21, 2021, pp. 10–18. in conversational recommendation, in: SIGIR ’21,
[11] A. Wang, C. D. V. Hoang, M.-Y. Kan, Perspectives 2021, pp. 808–817.
on crowdsourcing annotations for natural language [26] P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Ana-
processing, Language resources and evaluation 47 niadou, J. Tsujii, Brat: a web-based tool for nlp-
(2013) 9–31. assisted text annotation, in: ACL ’12, 2012, pp.
[12] K. Zhou, W. X. Zhao, S. Bian, Y. Zhou, J.-R. Wen, 102–107.
J. Yu, Improving conversational recommender sys- [27] C. Zong, R. Xia, J. Zhang, Data Annotation and
tems via knowledge graph based semantic fusion, Preprocessing, 2021, pp. 15–31.
in: KDD ’20, 2020, pp. 1006–1014. [28] P. Röttger, B. Vidgen, D. Hovy, J. B. Pierrehumbert,
[13] Y. He, L. Liao, Z. Zhang, T.-S. Chua, Towards en- Two contrasting data annotation paradigms for sub-
riching responses with crowd-sourced knowledge jective nlp tasks, 2021. a r X i v : 2 1 1 2 . 0 7 4 7 5 .
for task-oriented dialogue, in: MuCAI ’21, 2021, p. [29] D. Jannach, ADVISOR SUITE – A knowledge-based
3–11. sales advisory system, in: ECAI ’04, 2004, pp.
[14] T. Arjannikov, C. Sanden, J. Z. Zhang, Verifying 720–724.
tag annotations through association analysis, in: [30] K. McCarthy, Y. Salem, B. Smyth, Experience-based
ISMIR ’13, 2013, pp. 195–200. critiquing: Reusing critiquing experiences to im-
prove conversational recommendation, in: ICCBR conversational recommendation, Information Sys-
’10, 2010, pp. 480–494. tems (2022) 102083.
[31] L. Chen, P. Pu, Critiquing-based recommenders: [37] D. Jannach, Evaluating conversational recom-
survey and emerging trends, User Modeling and mender systems, Artificial Intelligence Review
User-Adapted Interaction 22 (2012) 125–150. forthcoming (2022).
[32] Q. Chen, J. Lin, Y. Zhang, H. Yang, J. Zhou, J. Tang, [38] T. Zhang, Y. Liu, P. Zhong, C. Zhang, H. Wang,
Towards knowledge-based personalized product de- C. Miao, KECRS: Towards knowledge-enriched
scription generation in e-commerce, in: KDD ’19, conversational recommendation system, 2021.
2019, pp. 3040–3050. arXiv:2105.08261.
[33] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: [39] Y. Zhou, K. Zhou, W. X. Zhao, C. Wang, P. Jiang,
Pre-training of deep bidirectional transformers for H. Hu, C2 -crs: Coarse-to-fine contrastive learn-
language understanding, in: NAACL-HLT, 2019. ing for conversational recommender system, in:
[34] L. Wang, H. Hu, L. Sha, C. Xu, K.-F. Wong, D. Jiang, WSDM ’22, 2022, pp. 1488–1496.
Finetuning large-scale pre-trained language models [40] Y. Li, B. Peng, Y. Shen, Y. Mao, L. Liden, Z. Yu,
for conversational recommendation with knowl- J. Gao, Knowledge-grounded dialogue generation
edge graph, 2021. a r X i v : 2 1 1 0 . 0 7 4 7 7 . with a unified knowledge representation, 2021.
[35] A. Manzoor, D. Jannach, Conversational recom- arXiv:2112.07924.
mendation based on end-to-end learning: How far [41] D. Jannach, A. Manzoor, End-to-end learning for
are we?, Computers in Human Behavior Reports conversational recommendation: A long way to
(2021) 100139. go?, in: IntRS Workshop at RecSys ’20, Online,
[36] A. Manzoor, D. Jannach, Towards retrieval-based 2020.