INSPIRED2: An Improved Dataset for Sociable Conversational Recommendation Ahtsham Manzoor∗ , Dietmar Jannach University of Klagenfurt, Universitätsstraße 65-67, Klagenfurt am Wörthersee, 9020, Austria Abstract Conversational recommender systems (CRS) that are able to interact with users in natural language often utilize recommenda- tion dialogs which were previously collected with the help of paired humans, where one plays the role of a seeker and the other as a recommender. These recommendation dialogs include items and entities that indicate the users’ preferences. In order to precisely model the seekers’ preferences and respond consistently, CRS typically rely on item and entity annotations. A recent example of such a dataset is INSPIRED, wich consists of recommendation dialogs for sociable conversational recommendation, where items and entities were annotated using automatic keyword or pattern matching techniques. An analysis of this dataset unfortunately revealed that there is a substantial number of cases where items and entities were either wrongly annotated or annotations were missing at all. This leads to the question to what extent automatic techniques for annotations are effective. Moreover, it is important to study impact of annotation quality on the overall effectiveness of a CRS in terms of the quality of the system’s responses. To study these aspects, we manually fixed the annotations in INSPIRED. We then evaluated the performance of several benchmark CRS using both versions of the dataset. Our analyses suggest that the improved version of the dataset, i.e., INSPIRED2, helped increase the performance of several benchmark CRS, emphasizing the importance of data quality both for end-to-end learning and retrieval-based approaches to conversational recommendation. We release our improved dataset (INSPIRED2) publicly at https://github.com/ahtsham58/INSPIRED2. Keywords Conversational Recommender Systems, data quality, annotations, evaluation, dialog systems 1. Introduction concepts that appear in the dialogs. In the movies do- main, for example, being able to exactly identifying the Sociable conversational recommender systems (CRS) aim items (i.e., movies) and related entities and concepts (e.g., to build rapport with users while interacting with them in actors or genres) can play a pivotal role for building an natural language [1, 2]. CRS that rely on natural language effective system. Existing CRS for example arrange such processing (NLP) nowadays commonly utilize datasets of entities and their relationships as graphs [5, 6], and these previously recorded dialogs between humans, where one relationships often form the basis to model the users’ plays the role of a recommendation-seeker and the other preferences, e.g., [7, 8, 9]. Moreover, domain specific as human-recommender, see e.g., [3]. However, due to a concepts and entities can also contribute to the genera- certain lack of rich sociable interactions in such datasets tion of meaningful and coherent responses, especially in [4], it can be challenging to build a sociable CRS that knowledge-aware CRS, see [10, 11, 12, 13]. builds rapport with the users using such limited data. Annotating items and entities can be a laborious and Therefore, it is important to develop datasets like IN- economically expensive process [14, 15]. Human costs SPIRED [1], which includes dialogs that implement rich are high and may even be prohibitive for domains where social communication strategies. Such rich datasets rep- particular knowledge or expertise are required to accom- resent a solid basis to develop trustable CRS that are able plish the annotation task [16]. In that context, the qual- to engage users in a natural and user-adaptive manner. ity of the resulting annotations is crucial, and factually Another key factor for building high-quality CRS lies in wrong annotations can lead to errors or ambiguity for the proper recognition of the named entities and other the downstream task. Automating the annotation task or at least automatically verifying the annotations [14] has 4th Edition of Knowledge-aware and Conversational Recommender Sys- therefore been in the focus of research for several years. tems (KaRS) Workshop @ RecSys 2022, September 18–23 2023, Seattle, We note here that data quality is crucial both for recent WA, USA. ∗ Corresponding author. generation-based CRS approaches as well as for retrieval- Envelope-Open ahtsham.manzoor@aau.at (A. Manzoor); based approaches to build natural language conversa- dietmar.jannach@aau.at (D. Jannach) tional systems [17]. For both types of systems, the ques- GLOBE https://ahtsham58.github.io/ (A. Manzoor); https://www.aau.at/ tion arises to what extent better data quality, i.e., having en/aics/research-groups/infsys/team/dietmar-jannach/ (D. Jannach) correct annotations and noise-free conversations, leads Orcid 0000-0001-9418-753 (A. Manzoor); 0000-0002-4698-8507 to better results in terms of the quality of the responses (D. Jannach) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License returned by a system for a given user utterance, e.g., in Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 terms of consistency and plausibility. In this work, we study the recent INSPIRED dataset, A number of new datasets for conversational recom- in which the items and entities that were mentioned in mendation were published in recent years, e.g., [3, 21, the recorded utterances are explicitly annotated. These 22, 23]. Such datasets, which are commonly collected annotations were created with the help of automatic ap- with the help of crowdworkers, can however have limi- proaches using keyword or pattern matching methods. tations and may be not fully representative in terms of However, looking at the data, we observed a substantial what we would observe in reality. In some cases, for number of cases where items and entities were either example, crowdworkers were instructed to mention a wrongly annotated or missing annotations at all, e.g., minimum number of movies in the conversations. This “My favorite [ MOVIE_GENRE_1] are Groundhogs Day, leads to mostly “instance-based” conversations, where [ MOVIE_TITLE_2] and Borat”. In addition, there were crowdworkers rather mention individual movies they several cases where the utterances included noise, e.g., like than their preferred genres, see also, [3, 24, 25]. “How did you like QUOTATION_MARKHustlersQUOTA- Another problem when creating such datasets lies in TION _MARK?”. Finally, we found instances where regu- the recognition and annotation of named entities appear- lar words were identified as being named entities. In this ing in the conversations, as mentioned above. Annotat- latter case, human annotations would in fact have been ing entities in textual data can be a tedious process that required.1 Overall, such issues may limit the quality of may require a substantial amount of manual effort and any CRS that is built on top of such data. time. To overcome this challenge, researchers sometimes To understand the severity of the problem and the po- adapt a semi-automatic approach or rely on NLP-assisted tential effects of data issues on the quality of a CRS, we tools that visualize the entities in a text in order to re- have manually corrected the dataset by fixing the annota- duce the required manual effort [16, 15, 26]. Generally, tions and by removing noise from the utterances. Then, some automatic approaches may experience problems to we conducted offline experiments and human evalua- correctly create annotations because human judgments tions to compare the performance of different benchmark and opinions are required. An automated approach was CRS when using the original (INSPIRED) and improved used in the context of the INSPIRED dataset. Here, the (INSPIRED2) datasets. Overall, the results of our anal- items and entities were annotated using keyword or pat- yses indicate that all CRS showed better performance tern matching approaches. However, verifying the out- in different dimensions when built on INSPIRED2. In comes of such automatic or semi-automatic approaches order to facilitate the design and development of future can again be laborious and require manual effort. sociable CRS, we release the INSPIRED2 dataset online Today, structured annotations for items and entities at https://github.com/ahtsham58/INSPIRED2. mentioned in the conversations are common in recent datasets. For example, in the case of the ReDial dataset [3], the mentioned movie titles were annotated with 2. Related Work unique IDs. However, the ReDial dataset has some limi- tations. Various meta-data concepts (e.g., genres, actors, In this section, we first discuss datasets and aspects of or directors) were not annotated. Moreover, the recorded data quality in the context of CRS. Afterwards, we review dialogs include limited social interactions or explanations different design paradigms for building CRS, followed by for the made recommendations. On the other hand, the a discussion of predominant evaluation approaches for INSPIRED dataset includes rich sociable conversation such systems. and explanation strategies for the recommended items. Also, aspects like movie genres or actors were explicitly Datasets and Data Quality Research interest in CRS annotated too. A comparison of these differences can be has experienced a substantial growth in recent years, found in [1]. The key statistics of the INSPIRED dataset see [18, 19] for related surveys. Many current systems are shown in Table 1. interact with users in natural language, and one impor- As mentioned earlier, the INSPIRED dataset has some tant goal for such system is to enable them to engage in limitations. The keyword or pattern matching approach conversations that reflect human behavior. Since many used for the annotations might for example not detect of these recent systems are built on recorded dialogs misspelled keywords or concepts in an utterance. More- between humans, the capabilities of the resulting CRS over, data anomalies such as noisy utterances or ill- depend on the richness of the communication in the formed language can deteriorate the performance of an datasets, e.g., in terms of the user intents that can be annotating algorithm, leading to challenges for the down- found in the conversations, see [20] for an detailed anal- stream use of the dataset [15, 27, 28]. In reality, the level ysis of such intents. of noise can be substantial both in real-world applica- tions and in purposefully created datasets. Therefore, 1 Consider the movie “It (2017)” as an example of a difficult case, e.g., data quality assurance is often considered a significant when appearing in an utterance like “Have you seen It?”. and important step in NLP applications. Table 1 CRS Evaluation Evaluating a CRS is a multi-faceted Main Statistics of INSPIRED and challenging problem as it requires the consideration Total of various quality dimensions. An in-depth discussion of evaluation approaches for CRS can be found in [37]. Number of dialogs (conversations) 1,001 Like in the recommender systems literature in general, Average turns per dialog 10.73 computational experiments that do not involve humans Average tokens per utterance 7.93 in the loop are the predominant instrument to assess Number of human-recommender utterances 18,339 the quality of a CRS. Common metrics to evaluate the Number of seeker utterances 17,472 quality of the recommendations include Recall, Hit Rate, or Precision [8, 21, 38]. Moreover, certain linguistic as- pects such as fluency or diversity are often evaluated Building Conversational Recommender Systems with offline experiments as well to assess the quality of Research on CRS has made substantial progress in terms the generated responses. Common metrics in this area of their underlying technical approaches. Some early include Perplexity, distinct N-Gram, or the BLEU score commercial system such as Advisor Suite [29] for example [3, 8, 13, 22]. relied on an entirely knowledge-based approach for the Given the interactive nature of CRS, offline experi- development of adaptive and personalized applications. ments and the corresponding metrics have their limi- Similarly, early critiquing-based systems were based on tations. Mainly, it is not always clear if the results ob- detailed knowledge about item features and possible cri- tained from offline experiments are representative of the tiques and had limited learning capabilities [30, 31]. user-perceived quality of the recommendations or sys- Technological advancements particularly in fields like tem responses in general [35]. For example, when using NLP, speech recognition, and machine learning in general metrics like the BLEU score, usually a system response is led to the design of today’s end-to-end learning-based compared with one particular given ground truth. Such CRS. In such approaches, recorded recommendation di- a comparison has limitations when used to estimate the alogs between paired humans are used to train the deep average quality of a system’s responses, because there neural models, see, e.g., [8, 9, 10, 12]. Given the last user might be many different alternative responses that might utterance and the history of the ongoigng dialog history, be suitable as well in an ongoing dialog. Still, offline these trained models are then used to generate responses evaluations have their place and value. They can for in natural language. These responses can either include example be informative for assessing particular aspects item recommendations, which are also computed with such as the number of items or entities that appear in an the help of machine learning techniques, or other types utterance or conversation. of conversational elements, e.g., greetings. Overall, given the limitations of pure offline experi- In terms of the underlying data, the DeepCRS [3] sys- ments, researchers often follow a mixed approach where tem was built on the ReDial dataset, which was created some aspects of the system are evaluated offline and some in the context of this work. Later on, systems were devel- with humans. Typical quality aspects in terms of human oped which also relied on this dataset as well but included perceptions in such combined approaches include the additional information sources, e.g., from DBPedia or Con- assessment of the meaningfulness or consistency of the ceptNet [32, 12], to build knowledge graphs that are then system responses [1, 8, 12, 13, 36]. used to improve the generated utterances. A number of works also makes use of pretrained language models like BERT [33] and subsequently fine-tune them using 3. Data Annotation Methodology the recommendation dialogs, see, e.g., [34]. A related approach was adapted by the authors of INSPIRED, in During the creation of the INSPIRED [1] dataset, items which they proposed two variants of a conversational and other entities were annotated in an automated way, system, with and without strategy labels. as described above. For example, genre keywords were Unlike generation-based systems, in retrieval-based annotated using a regular expression to match a set of CRS the idea is to retrieve and adapt suitable responses predefined tokens. Regarding actors and directors and from the dataset of recorded dialogs. One main advan- other entities, a pattern matching technique was used, tage of retrieval-based approaches is that the retrieved where words starting with a capital letter were searched responses were genuinely made by humans and thus in the TMDB database2 . A similar technique was used for are grammatically usually correct and in themselves se- movie titles. However, as mentioned, we observe a large mantically meaningful [35]. Recent examples of such number of cases where items and entities were either retrieval-based systems are RB-CRS [17] and CRB-CRS wrongly annotated or missing annotations. To answer [36], which we designed and evaluated based on the Re- our research question on the impact of the quality of the Dial dataset in our own previous work. 2 https://www.themoviedb.org underlying data on the quality of the responses of a CRS, Observed Issues During the annotation process, we we fixed the annotations as follows. recorded the observed issues in the original annotations. Since the original annotations were created using auto- Procedure To fix the annotations, we interviewed a matic techniques, many issues were related to the limi- number of university students to assess their knowledge tations of the simple keyword or pattern matching tech- in the movies domain and their ability to do the correction niques. Overall, we observed a number of cases where task. Subsequently, we hired two students and instructed minor spelling mistakes or incomplete movie titles made them on how to annotate and clean the dataset. First, the exact string matching approaches ineffective. they were briefed on the logical format of the original For example, in one of the utterances, “I think I am annotations and how to retain that format. Second, they waiting for Star Wars The Rise of Skywalker”, the annota- were asked to read each utterance individually, to detect tion was missing because the correct title is “Star Wars: potential noise, and to analyze which items or entities Episode IX – The Rise of Skywalker”. Similarly, we ob- (e.g., title, genre, actor, or director) are mentioned in it. serve a significant number of cases where an utterance In case of ambiguity or obscurity, they were allowed was only partially annotated, e.g., “ok is it scary like in- to access online portals, e.g., IMDb3 . Note that regarding cidious or [ MOVIE_GENRE_2] [ MOVIE_TITLE_5]”. In the genres, a set of 27 keywords was provided to them, addition, at places where two entities were separated which we curated and used in our earlier research [36]. with ‘/’ instead of a space, the automatic technique often After the briefing, the dataset was split evenly for both failed to create proper annotations, e.g., “Since you like annotators. On weekly basis, their performance and the [MOVIE_GENRE_1] drama/mystery, I’m going to send you accuracy of the annotations was checked by one of the the trailer to the movie [MOVIE_TITLE_3]”. authors. Finally, after annotating the complete dataset, a Also, the automatic approach used for INSPIRED some- number of additional validation steps were applied. times had difficulties to deal with ambiguity. We found First, using a Python script, we ensured that every a number of cases where a regular word was annotated, placeholder is enclosed by ‘[’ and ‘]’ as was done origi- although such a word did not belong to any item or en- nally, e.g., [MOVIE_TITLE_1]. Second, another thorough tity. For example, in one of the cases, “Are you interested manual examination of the entire improved dataset was in a current movie in the box office? ”, the utterance was performed to fix any missing annotations or noise. In annotated as “Are you interested in a current movie in that context, we also double-checked the consistency of the box [ MOVIE_TITLE_0]”, where the word ‘office’ was the format and of the annotations. mistakenly annotated as an item, i.e., The Office (2005). Overall, the main observed issues are the following. The INSPIRED2 Dataset In total, 1,851 new annota- 1. Missing annotations for movie titles, genres, ac- tions were added to INSPIRED, leading to the INSPIRED2 tors, movie plots, etc. dataset. The most mistakes or inconsistencies were found 2. Partially annotated items and entities such as for the items, i.e., movie titles, which is the most pertinent movie titles, or genres in an utterance. information for developing a CRS. We present the statis- 3. Factually wrong annotations for movie titles. tics about new annotations in Table 2. Overall, we added 4. Inconsistent indexing for the annotated items and around 20% new annotations in INSPIRED2. The number entities. of issues that were fixed, e.g., duplicate annotations in 5. Mistaken annotations for plain text, e.g., family, an utterance, noise or factually wrong information in box office; human annotations may be required the original annotations, are not shown in the presented here. statistics. We release the INSPIRED2 both in the TSV and JSON format online. 6. Parts of the utterance or a few keywords were omitted during the annotation process. Table 2 Statistics about new annotations added in INSPIRED2 4. Evaluation Methodology Total % Increase We performed both offline experiments as well as a hu- Number of movie titles 966 22.0 man evaluation to assess the impact of data quality on Number of movie genres 206 5.0 Number of actors, directors, etc. 519 49.0 the quality of the responses of a CRS. Number of movie plots 160 54.6 Number of new annotations 1851 18.9 Offline Evaluation of Recommendation Quality We included the following recent end-to-end learning approaches in our experiments: DeepCRS [3], KGSF [12], 3 https://www.imdb.com/ TG-ReDial [22], and the INSPIRED model without strat- 5. Results egy labels 4 [1]. This selection of models covers vari- ous design approaches for CRS, e.g., using an additional Recommendation Quality Table 3 shows the accu- knowledge graph or not. We used the open-source toolkit racy results for the evaluated CRS models. Specifically, CRSLab5 for our evaluations. This framework was used we provide the results for the different benchmark CRS in earlier research as well, for example in [10, 39, 40]. models in terms of the performance difference when us- For our analyses, we first trained the aforementioned ing the original and improved annotations. Overall, we CRS models using the original split ratio, i.e., 8:1:1, for can observe an almost consistent gain in performance for each dataset. Afterwards, given the trained models and all models and on all metrics except Hit@50 when the im- test data for each dataset, we ran three trials for each proved dataset is used. The obtained improvements can CRS and subsequently averaged the results for offline be quite substantial, indicating that improved data qual- evaluation metrics. Note that the same procedure was ity can be helpful for CRS of different types, including (i) adapted for both versions of the dataset, i.e., INSPIRED CRS, which do not rely on additional knowledge sources, and INSPIRED2. (ii) CRS that leverage additional knowledge sources, (iii) CRS that are guided by a topic policy, and (iv) CRS that User Study on Linguistic Quality We conduct a user rely on pre-trained language models like BERT. study to compare the perceived quality of system re- Interestingly, we see negative effects for two measure- sponses using either INSPIRED and INSPIRED2. Specif- ments in which Hit@50 is used as a metric. A deeper ically, we randomly sampled same 50 dialog situations investigation of this phenomenon is needed, in particular from each dataset. To create the dialog continuations, as the other metrics at this (admittedly rather uncommon) we used the retrieval-based CRS approaches, RB-CRS and list length, MRR@50 and NDCG@50, indicate that the CRB-CRS, which we proposed in our earlier work, see improved dataset is helpful to increase recommendation [36]. accuracy. At the moment, we can only speculate that In order to obtain fine-grained assessments, three hu- the improved annotations in the ongoing dialog histories man judges6 were involved. The specific task of the led to more diverse or niche recommendations compared judges was to assess (rate) the meaningfulness of a sys- to the original dataset. We might assume that the miss- tem response as a proxy of its quality and consistency in a ing annotations in many cases referred to less popular dialog situation, see [3, 41, 12]. Note that in this study we movies, so that the recommendations without the im- did not explicitly assess the quality of the specific item proved annotations will more often recommend popular recommendations. Instead, the focus of this study was movies, which is commonly advantageous in terms of hit to understand the impact of the improved underlying rate and recall. dataset on the linguistic quality and the consistency of the generated responses. Linguistic Quality We recall that three human eval- We used a 3-point scale for these ratings, from ‘Com- uators assessed the linguistic quality of the system re- pletely meaningless (1)’ to ‘Somewhat meaningless and sponses (dialog continuations), which were created either meaningful (2)’ to ‘Completely meaningful (3)’. The hu- based on the INSPIRED or the INSPIRED2 dataset. As un- man judges were provided with specific instructions on derlying CRS systems, we considered the retrieval-based how to evaluate the meaningfulness of a response, e.g., approaches RB-CRS and CRB-CRS, as mentioned above. they should assess if a response represents a logical dialog For our analysis, we averaged the scores by the three continuation and evaluate the overall language quality evaluators. Table 4 shows the mean ratings across all of the given response. Overall, the human judges were dialog situations as well as the standard deviations. We provided 50 dialogs (446 responses to rate) that were pro- find that also in the case of retrieval-based approaches, duced using the INSPIRED and INSPIRED2 datasets. We improving the quality of the underlying dataset was help- also explained the meanings and purpose of various place- ful, leading to higher mean scores, without observing holders contained in the responses to the human judges. larger standard deviations. A Student’s t-test reveals that Moreover, to avoid any bias in the evaluation process, the the observed differences in the means are statistically judges were not made aware which response was created significant (p<0.001). 7 for which dataset by which CRS. Also, the order of the dialogs and the system responses were randomized. Comparison of Knowledge Concepts in Responses To understand the impact of the new annotations on the responses in terms of the richness of knowledge con- 4 cepts, we compute the number of items and entities that The INSPIRED with strategy labels model was not publicly available. 5 https://github.com/RUCAIBox/CRSLab 6 These judges were PhD students and were different than the ones 7 who fixed the annotations. We provide the data and compiled results of our study online. Table 3 Accuracy results obtained in the offline evaluation. V1 represents INSPIRED, V2 denotes INSPIRED2, and “% Change” represents the actual performance gain/loss when using INSPIRED2 compared to INSPIRED. Hit@1 Hit@10 Hit@50 MRR@1 MRR@10 MRR@50 NDCG@1 NDCG@10 NDCG@50 V1 0.0006 0.0464 0.1726 0.0065 0.0148 0.0193 0.0065 0.0220 0.0478 DeepCRS V2 0.0256 0.0578 0.1222 0.0256 0.0306 0.0333 0.0256 0.0366 0.0504 [3] % Change 4161.11 24.50 -29.20 294.95 106.99 72.48 294.95 66.08 5.31 V1 0.0022 0.0216 0.0744 0.0032 0.0061 0.0084 0.0022 0.0097 0.0211 KGSF [12] V2 0.0066 0.0303 0.0587 0.0057 0.0123 0.0134 0.0066 0.0165 0.0223 % Change 207.27 40.46 -21.11 75.58 100.97 58.40 207.27 70.36 5.78 V1 0.0365 0.1149 0.2344 0.0365 0.0572 0.0626 0.0365 0.0707 0.0967 TG-ReDial V2 0.0511 0.1315 0.2417 0.0511 0.0742 0.0792 0.0511 0.0877 0.1118 [22] % Change 40.00 14.46 03.12 40.00 29.64 26.48 40.00 24.05 15.51 V1 0.0151 0.0550 0.1532 0.0151 0.0241 0.0286 0.0151 0.0312 0.0527 INSPIRED V2 0.0194 0.0734 0.1855 0.0194 0.0293 0.0353 0.0194 0.0392 0.0650 [1] % Change 28.57 33.33 21.13 28.57 21.59 23.44 28.57 25.44 23.28 Table 4 the BLEU scores generally improve when the underlying Results of Human Evaluation data quality is higher, i.e., in the case of the INSPIRED2 INSPIRED INSPIRED2 dataset. These findings are thus well aligned with the RB-CRS Average score 2.30 2.46 outcomes of our human evaluation study, where using Std. deviation 0.62 0.59 INSPIRED2 as an underlying dataset turned out to be CRB- Average score 2.31 2.46 favorable. CRS Std. deviation 0.55 0.55 6. Conclusion appeared in the system responses. Specifically, we com- Datasets containing recorded dialogs between humans pute the number of placeholders in the responses, before are the basis for many modern CRS. In this work, we they would be replaced by the recommendation compo- have analyzed the recent INSPIRED dataset, which was nent, see [36]. In Table 5, we present the statistics for developed to build the next generation of sociable CRS. RB-CRS and CRB-CRS for both dataset versions. Over- We found that automatic entity and concept labeling has all, we find that the responses for the improved dataset its limitations and we have improved the quality of the contain between 20% and 27% more concepts and enti- dataset through a manual process. We then conducted ties. We note that an increase in concepts is expected, as both computational experiments as well as experiments INSPIRED2 has almost 20% more annotations. However, with users to analyze to what extent improved data qual- the important observation here is that the retrieval-based ity impacts recommendation accuracy and the quality CRS approaches actually surfaced these richer system perception of the system’s responses by users. The analy- responses frequently. ses clearly indicate the benefits of improved data quality across different technical approaches for building CRS. Table 5 We release the improved dataset publicly and hope to Number of Items and Entities included in Responses thereby stimulate more research in sociable conversa- tional recommender systems in the future. INSPIRED INSPIRED2 % Increase RB-CRS 174 222 27.6 CRB-CRS 208 251 20.7 BLEU Score Analysis Finally, in order to understand to what extent (offline) linguistic scores correlate with the perceived quality of responses as was done in [1, 8], we performed an analysis of the BLEU scores obtained for the different datasets. Specifically, given a system response and the corresponding ground truth response, we preprocess both sentences and compute the BLEU scores for 𝑁 = {1, 2, 3, 4} grams. We provide the results of this analysis online. In sum, the analysis shows that References [15] J. S. Grosman, P. H. Furtado, A. M. Rodrigues, G. G. Schardong, S. D. Barbosa, H. C. Lopes, Eras: Improv- [1] S. A. Hayati, D. Kang, Q. Zhu, W. Shi, Z. Yu, IN- ing the quality control in the annotation process SPIRED: Toward sociable recommendation dialog for natural language processing tasks, Information systems, in: EMNLP ’20, 2020. Systems 93 (2020) 101553. [2] F. Pecune, L. Callebert, S. Marsella, A socially-aware [16] B. C. Benato, J. F. Gomes, A. C. Telea, A. X. Falcão, conversational recommender system for personal- Semi-automatic data annotation guided by feature ized recipe recommendations, in: Proceedings of space projection, Pattern Recognition 109 (2021) the 8th International Conference on Human-Agent 107612. Interaction, HAI ’20, 2020, p. 78–86. [17] A. Manzoor, D. Jannach, Generation-based vs. [3] R. Li, S. E. Kahou, H. Schulz, V. Michalski, L. Charlin, retrieval-based conversational recommendation: A C. Pal, Towards deep conversational recommenda- user-centric comparison, in: RecSys ’21, 2021. tions, in: NIPS ’18, 2018, pp. 9725–9735. [18] D. Jannach, A. Manzoor, W. Cai, L. Chen, A survey [4] D. Jannach, L. Chen, Conversational Recommen- on conversational recommender systems, ACM dation: A Grand AI Challenge, AI Magazine 43 Computing Surveys 54 (2021) 1–36. (2022). [19] C. Gao, W. Lei, X. He, M. de Rijke, T.-S. Chua, Ad- [5] M. Di Bratto, M. Di Maro, A. Origlia, F. Cutugno, vances and challenges in conversational recom- Dialogue analysis with graph databases: Character- mender systems: A survey, AI Open 2 (2021) ising domain items usage for movie recommenda- 100–126. tions (2021). [20] W. Cai, L. Chen, Predicting user intents and satis- [6] C.-M. Wong, F. Feng, W. Zhang, C.-M. Vong, faction with dialogue-based conversational recom- H. Chen, Y. Zhang, P. He, H. Chen, K. Zhao, H. Chen, mendations, in: UMAP ’20, 2020, p. 33–42. Improving conversational recommender system [21] D. Kang, A. Balakrishnan, P. Shah, P. Crook, Y.-L. by pretraining billion-scale knowledge graph, in: Boureau, J. Weston, Recommendation as a commu- ICDE ’21, 2021, pp. 2607–2612. nication game: Self-supervised bot-play for goal- [7] Y. Cao, X. Wang, X. He, Z. Hu, T.-S. Chua, Unifying oriented dialogue, in: EMNLP-IJCNLP ’19, 2019, pp. knowledge graph learning and recommendation: 1951–1961. Towards a better understanding of user preferences, [22] K. Zhou, Y. Zhou, W. X. Zhao, X. Wang, J.-R. Wen, in: WWW ’19, 2019, pp. 151–161. Towards topic-guided conversational recommender [8] Q. Chen, J. Lin, Y. Zhang, M. Ding, Y. Cen, H. Yang, system, in: ICCL ’20, 2020, pp. 4128–4139. J. Tang, Towards knowledge-based recommender [23] Z. Fu, Y. Xian, Y. Zhu, Y. Zhang, G. de Melo, dialog system, in: EMNLP-IJCNLP ’19, 2019, pp. COOKIE: A dataset for conversational recommen- 1803–1813. dation over knowledge graphs in e-commerce, 2020. [9] J. Zhou, B. Wang, R. He, Y. Hou, CRFR: Improv- arXiv:2008.09237. ing conversational recommender systems via flexi- [24] K. Christakopoulou, F. Radlinski, K. Hofmann, To- ble fragments reasoning on knowledge graphs, in: wards conversational recommender systems, in: EMNLP ’21, 2021, pp. 4324–4334. KDD ’16, 2016, pp. 815–824. [10] K. Chen, S. Sun, Knowledge-based conversational [25] X. Ren, H. Yin, T. Chen, H. Wang, Z. Huang, recommender systems enhanced by dialogue policy K. Zheng, Learning to ask appropriate questions learning, in: IJCKG ’21, 2021, pp. 10–18. in conversational recommendation, in: SIGIR ’21, [11] A. Wang, C. D. V. Hoang, M.-Y. Kan, Perspectives 2021, pp. 808–817. on crowdsourcing annotations for natural language [26] P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Ana- processing, Language resources and evaluation 47 niadou, J. Tsujii, Brat: a web-based tool for nlp- (2013) 9–31. assisted text annotation, in: ACL ’12, 2012, pp. [12] K. Zhou, W. X. Zhao, S. Bian, Y. Zhou, J.-R. Wen, 102–107. J. Yu, Improving conversational recommender sys- [27] C. Zong, R. Xia, J. Zhang, Data Annotation and tems via knowledge graph based semantic fusion, Preprocessing, 2021, pp. 15–31. in: KDD ’20, 2020, pp. 1006–1014. [28] P. Röttger, B. Vidgen, D. Hovy, J. B. Pierrehumbert, [13] Y. He, L. Liao, Z. Zhang, T.-S. Chua, Towards en- Two contrasting data annotation paradigms for sub- riching responses with crowd-sourced knowledge jective nlp tasks, 2021. a r X i v : 2 1 1 2 . 0 7 4 7 5 . for task-oriented dialogue, in: MuCAI ’21, 2021, p. [29] D. Jannach, ADVISOR SUITE – A knowledge-based 3–11. sales advisory system, in: ECAI ’04, 2004, pp. [14] T. Arjannikov, C. Sanden, J. Z. Zhang, Verifying 720–724. tag annotations through association analysis, in: [30] K. McCarthy, Y. Salem, B. Smyth, Experience-based ISMIR ’13, 2013, pp. 195–200. critiquing: Reusing critiquing experiences to im- prove conversational recommendation, in: ICCBR conversational recommendation, Information Sys- ’10, 2010, pp. 480–494. tems (2022) 102083. [31] L. Chen, P. Pu, Critiquing-based recommenders: [37] D. Jannach, Evaluating conversational recom- survey and emerging trends, User Modeling and mender systems, Artificial Intelligence Review User-Adapted Interaction 22 (2012) 125–150. forthcoming (2022). [32] Q. Chen, J. Lin, Y. Zhang, H. Yang, J. Zhou, J. Tang, [38] T. Zhang, Y. Liu, P. Zhong, C. Zhang, H. Wang, Towards knowledge-based personalized product de- C. Miao, KECRS: Towards knowledge-enriched scription generation in e-commerce, in: KDD ’19, conversational recommendation system, 2021. 2019, pp. 3040–3050. arXiv:2105.08261. [33] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: [39] Y. Zhou, K. Zhou, W. X. Zhao, C. Wang, P. Jiang, Pre-training of deep bidirectional transformers for H. Hu, C2 -crs: Coarse-to-fine contrastive learn- language understanding, in: NAACL-HLT, 2019. ing for conversational recommender system, in: [34] L. Wang, H. Hu, L. Sha, C. Xu, K.-F. Wong, D. Jiang, WSDM ’22, 2022, pp. 1488–1496. Finetuning large-scale pre-trained language models [40] Y. Li, B. Peng, Y. Shen, Y. Mao, L. Liden, Z. Yu, for conversational recommendation with knowl- J. Gao, Knowledge-grounded dialogue generation edge graph, 2021. a r X i v : 2 1 1 0 . 0 7 4 7 7 . with a unified knowledge representation, 2021. [35] A. Manzoor, D. Jannach, Conversational recom- arXiv:2112.07924. mendation based on end-to-end learning: How far [41] D. Jannach, A. Manzoor, End-to-end learning for are we?, Computers in Human Behavior Reports conversational recommendation: A long way to (2021) 100139. go?, in: IntRS Workshop at RecSys ’20, Online, [36] A. Manzoor, D. Jannach, Towards retrieval-based 2020.