IDRE: AI Generated Dataset for Enhancing Empathetic Chatbot Interactions in Italian language. Simone Manai1,2,*,†, Laura Gemme2 ,†, Roberto Zanoli3 and Alberto Lavelli3 1 University of Trento, 38123 Trento, Italy 2 Lutech-Softjam, 16148 Genova, Italy 3 Fondazione Bruno Kessler, 38123 Trento, Italy Abstract This paper introduces IDRE (Italian Dataset for Rephrasing with Empathy), a novel automatically generated Italian linguistic dataset. IDRE comprises typical chatbot user utterances in the healthcare domain, corresponding chatbot responses, and empathetically enhanced chatbot responses. The dataset was generated using the Llama2 language model and evaluated by human raters based on predefined metrics. The IDRE dataset offers a comprehensive and realistic collection of Italian chatbot-user interactions suitable for training and refining chatbot models in the healthcare domain. This facilitates the development of chatbots capable of natural and productive conversations with healthcare users. Notably, the dataset incorporates empathetically enhanced chatbot responses, enabling researchers to investigate the effects of empathetic language on fostering more positive and engaging human-machine interactions within healthcare settings. The methodology employed for the construction of the IDRE dataset can be extended to generate sentences in additional languages and domains, thereby expanding its applicability and utility. The IDRE dataset is publicly available for research purposes. Keywords Empathy, LLMs, Llama2, Dataset, Chatbot, Healthcare1 1. Introduction the potential to enhance user experience and create a foundation for more sophisticated empathetic dialogue Emotional intelligence has been widely recognized systems. as a crucial factor influencing human communication, To facilitate the development of such systems, a impacting aspects such as behavioral choices and the robust dataset containing empathetic responses is interpretation of information [1]. Consequently, there essential. Despite the increasing body of research on has been a growing interest in developing chatbots emotion recognition and generation in human- capable of exhibiting empathetic responses [2] [3] [4]. computer interaction, there is a notable absence of While significant strides have been made in this publicly available datasets specifically focused on direction, the integration of empathy into commercial empathy in chatbot interactions. chatbots remains challenging due to the rigid This paper introduces the IDRE dataset, a new constraints imposed by business rules such as the Italian language resource comprising human-bot response must not lose the original meaning and the interactions within the healthcare domain. The dataset dialogue must maintain structure. is available publicly, and the address is provided in the To address this limitation, one possible approach is Online Resource section. The dataset includes the user to build a layer that rephrases the bot's response by questions, original bot responses and corresponding increasing empathy without altering the structure or empathetic reformulations for a total of 480 sentences, meaning of the underlying dialogue. This strategy offers providing a valuable foundation for research and CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, 0000-0002-7175-6804 (A. Lavelli); 0000-0003-0870-0872 (R. Dec 04 — 06, 2024, Pisa, Italy Zanoli) © 2024 Copyright for this paper by its authors. Use permitted under ∗ Corresponding author. Creative Commons License Attribution 4.0 International (CC BY 4.0). † These authors contributed equally. simone.manai@unitn.it (S. Manai); l.gemme@softjam.it (L. Gemme); zanoli@fbk.eu (R. Zanoli); lavelli@fbk.eu (A. Lavelli) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Table 1 Examples of Question, Answer and Answer with empathy development in empathetic chatbot technology, see question, 7k for the bot's response and 9k for the Table 1 for an example. The paper also elaborates on the response with empathy. methodology employed for dataset generation, highlighting its applicability to diverse domains and 3.1. Dataset Creation languages. The IDRE dataset comprises triplets of sentences, the first sentence represents a user query, the second 2. Related Works sentence is the corresponding response generated by a The development of empathetic chatbots capable of chatbot, and the third sentence is a transformed version understanding and responding to human emotions of the second sentence intended to enhance its represents a research area of growing interest [5]. empathetic tone. However, building such systems requires high-quality The sentence generation process was done by the datasets that include examples of human-machine Llama2 13B language model [11], operating on an Azure interactions with empathic components. Virtual Machine equipped with four NVIDIA Tesla V100 Despite the growing availability of datasets for GPUs. The choice of Llama 2 was motivated by its open- machine learning and natural language processing, the source nature, which allowed flexible and provider- lack of resources dedicated specifically to empathetic independent access. Italian-language chatbots represents a significant The dataset generation process consists of two challenge. phases as illustrated in Figure 1: There are datasets that contain emotional QnA Sentence Generation: To ensure the information, such as [6] [7] [8] [9]. However, these generation of empathetic and compassionate responses, resources focus primarily on labelling words or the healthcare domain was selected as the focus for the sentences with generic emotions and do not provide the initial set of bot-human sentence pairs. This domain, context for complex, nuanced conversational characterized by sensitive topics, is well-suited for interactions like those required for developing evaluating the model’s ability to generate empathetic empathetic chatbots. responses. The thirteen specific topics chosen for the sentence 3. Dataset pairs were invented for the purposes of the experiment: 'information on breast cancer', 'breast cancer This chapter details the methodology employed for prevention', 'therapies for breast cancer', 'psychological the construction of the IDRE dataset and outlines the support after a cancer diagnosis', 'life expectancy after a evaluation process implemented. cancer diagnosis', 'psychological support after surgery', The dataset created consists of 480 sentences and 'hospital admissions', 'post-operative care', 'information roughly 18k total tokens divided as follows: 2k for the on leukemia', 'psychological support', 'anti-cancer therapies', 'information on stroke', and 'preparation for The rating scale used is the following: surgeries.' An initial set of bot-human sentence pairs was 1. Totally disagree generated using the Llama2 model. These pairs 2. Disagree simulated a typical chatbot interaction concerning a 3. Neutral specific health issue or domain. For instance, a human 4. Agree query such as "What are the symptoms of COVID-19?" 5. Totally agree would elicit a corresponding chatbot response like "The The specific metrics used in this evaluation are: most common symptoms of COVID-19 are fever, dry cough, and tiredness". • Bot sentence correctness: measures the Empathy Enhancement: After the generation of absence of spelling, grammatical, or the initial sentence pairs, an empathy enhancement punctuation errors in the question and the bot’s process was undertaken. Leveraging the Llama2 model answer. The question used is: “Il testo della once more, the chatbot responses were modified to risposta con empatia è corretto sia dal punto di convey a more empathetic tone. This was achieved by vista grammaticale che semantico.” prepending expressions of concern or appreciation, and • Absence of English words in bot sentences: by substituting specific words to engender a supportive checks if there are any words or sentences in demeanor. To illustrate, the aforementioned chatbot English within the sentences generated by the response could be transformed into "I understand that model. The question used is: “Nel testo della you’re concerned about COVID-19. Some common domanda dell’utente e della risposta del bot symptoms include fever, dry cough, and fatigue". (colonne QUESTION e ANSWER) non sono Both prompts are included in the Appendix. presenti parole o frasi in lingua inglese, a meno che non siano di uso comune in italiano (ad esempio “badge”, “sport”, ecc.)” • Empathic answer correctness: measures the absence of spelling, grammatical, or punctuation errors in the bot’s answer with the insertion of empathy. The question used is: “Il testo della risposta con empatia è corretto sia dal punto di vista grammaticale che semantico.” • Absence of English words in empathic Figure 1: Dataset generation process sentences: checks if there are any words or sentences in English within the sentences with empathy generated by the model. The question 3.2. Evaluation Methodology used is: “Nel testo della risposta con empatia To ensure the quality of the generated sentences, a non sono presenti parole o frasi in lingua rigorous evaluation process was implemented. Twelve inglese, a meno che non siano di uso comune in volunteer annotators from Lutech-Softjam, experienced italiano (ad esempio “badge”, “sport”, ecc.)” IT developers and project managers with a solid • Semantic coherence: measures if the bot’s understanding of chatbot domain, participated. Despite answer and the bot’s answer with empathy are lacking prior experience in linguistic annotation, their semantically similar. The question used is: “La familiarity with chatbots significantly accelerated the risposta con empatia ha lo stesso significato evaluation process. Before start, they underwent semantico della risposta del chatbot. Non ci comprehensive training on the evaluation task. sono concetti mancanti o contraddittori” Each evaluator was assigned 70 sentences for • Empathy increase: measures if the bot’s assessment. To ensure diverse evaluations, 40 sentences answer with empathy has an effective increase were unique to each evaluator and used for dataset of empathy compared to the bot’s answer. The creation, while 30 common sentences were evaluated by question used is: “La frase nella colonna all evaluators, solely for measuring agreement and will ANSWER WITH EMPATHY esprime più not be part of the dataset. This approach ensured that empatia rispetto alla frase nella colonna each sentence received focused evaluation while also ANSWER” providing a consistent assessment across evaluators. The evaluation process involves the administration of a metric-specific question, which requires a response on a scale of 1 to 5. 4. Dataset Analysis This section analyses data quality by examining both the distribution of agreement scores and the level of inter-annotator agreement (IAA). Due to a limited pool of available evaluators, the dataset was constrained to 480 annotated sentences. These sentences were evenly distributed among 12 volunteers, each assessing 40 sentences (excluding the 30 sentences used for measuring agreement). This approach was made to ensure the quality of the Figure 2: Scores Distribution for all metrics. annotations while preventing evaluator fatigue. Regarding of analysis of the inter-annotator Nevertheless, a more in-depth analysis reveals that 223 agreement (IAA) for the annotations generated as sentences, equal to 46.5% of the total, have the score outlined in Section 3.2. Fleiss’ kappa coefficient was grater or equal to 3 on all the metrics considered. This employed to quantify the level of concordance between means that these sentences were judged to be of high multiple annotators while accounting for potential quality in every aspect analysed. This subset of data can chance agreement. Kappa values range from -1 to 1, with be used to finetune language models. negative values indicating agreement below chance, To obtain a more robust analysis and less subject to values between 0 and 0.2 representing slight agreement, small variations, the annotation categories were 0.21 to 0.4 fair agreement, 0.41 to 0.6 moderate grouped into three macro-categories: scores 1 and 2, agreement, 0.61 to 0.8 substantial agreement, and values score 3 (neutral) and score 4 and 5. exceeding 0.8 denoting almost perfect agreement. The analysis of sentences with lower score (1 and 2) The calculation of kappa coefficients on aggregated revealed three key factors: grammatical errors, the categories allowed to evaluate the inter-annotator presence of non-Italian words and lack of a significant agreement in a more robust way. The results are increase in empathy as shown in Figure 2. summarized in Table 2. Notably, the highest levels of Grammatical Errors: A substantial portion of agreement were observed for metrics related to the sentences with lower score exhibited grammatical errors presence of English words. This finding is likely (words in red). This highlights the importance of attributable to the relative simplicity of this specific incorporating robust grammar checks during the annotation task. Conversely, metrics assessing other generation process. Example: “Ohimini, cara/o utente, è linguistic features exhibited lower, yet still acceptable, comprensibile che durante il trattamento del tumore possa levels of agreement, generally falling within the esserti difficile gestire i sintomi. Sono qui per aiutarti a moderate range. trovare soluzioni e supporti per farcela insieme”. “Ohimini” is a made-up word and “supporti” contains a Table 2 typo. Agreement result Non-Italian Words: the lower score sentences Metrics Fleiss Aggregate frequently included non-Italian words (words in red), Kappa Fleiss primarily English. This deviation from the dataset’s Kappa focus on Italian-language interactions can be attributed Bot sentence correctness 0.608 0.821 to the underlying multilingual language model, which Absence of English words in 0.781 0.927 was predominantly trained on English text. This bot sentences highlights the need for improved language model Empathic answer correctness 0.566 0.807 training to prioritize Italian vocabulary. Example: “Per Absence of English words in 0.782 0.948 prevenire le infezioni after surgery, è importante seguire le empathic sentences istruzioni del medico e del personale ospedaliero, come ad Semantic coherence 0.587 0.881 esempio lavare le mani frequentemente, evitare di toccare Empathy increase 0.645 0.840 la ferita e utilizzare dispositivi di protezione individuali.” Lack of a significant increase in empathy: Figure 3 presents the distribution of annotations for Among the lower score sentences (173, representing three metrics: “Empathy increase”, “Bot sentence 36%), the transformed responses (indicated by the blue correctness”, and “Absence of English words in bot and orange columns) did not exhibit a significant rise in sentences”. The distribution for “Absence of English empathy or indecision compared to the original chatbot words in bot sentences” exhibits a marked concentration responses. This suggests that further refinement of the towards the highest score (5), indicating a strong empathy-enhancing techniques might be necessary. consensus among annotators regarding the absence of English words in bot sentences. In contrast, the evaluating empathetic chatbots in the healthcare distribution for “Empathy increase” or “Bot sentence domain. correctness” is more dispersed across the entire range of This work presents a two-pronged contribution to possible scores, suggesting a greater degree of variability the field of empathetic chatbots, specifically focusing on in annotator assessments of bot empathy increase. the Italian language. The observed disparity in distribution patterns Firstly, it addresses the critical issue of data scarcity between the metrics can be attributed to the inherent by providing a high-quality, annotated dataset for nature of the annotation tasks. The task of identifying training and evaluating empathetic chatbots within a the absence of English words in bot sentences is healthcare context. This dataset can be employed to fine- relatively straightforward and objective, leading to a tune large language models (LLMs) such as Llama2, higher degree of agreement among annotators. On the enabling them to generate responses with demonstrably other hand, assessing bot empathy increase involves a enhanced empathetic qualities. The limitations of non- more subjective judgment of factors such as fine-tuned models are exemplified through the grammatical accuracy, coherence, and relevance, observation that they can produce factually incorrect or resulting in a wider range of annotations. unempathetic sentences (e.g., " Il tuo corpo è vulnerabile The same behaviour can be noticed with metric al rischio del tumore al seno a causa della tua età “Empathic answer correctness”. avanzata, nonostante la tua vitalità e forza interiori. La storia familiare di tumori al seno nella tua famiglia e la tua condizione di obesità possono aumentare il rischio, come pure l'abuso di tabacco e alcool. Inoltre, la tua scelta di non avere figli o di averli dopo l'età di 35 anni può aggiungere ulteriore rischio al tuo corpo."). By leveraging the proposed dataset and selecting sentences with demonstrably high empathy scores, a targeted training set can be constructed specifically for this purpose. This, in turn, allows for the fine-tuning of the LLM, significantly improving its ability to generate empathetic responses in a healthcare setting. Secondly, the work contributes a rigorous human validation methodology for evaluating the effectiveness of empathy expression in chatbots. This methodology provides a valuable tool for researchers and developers working in this domain. 5.1. Future Work In the future, we intend to expand the work in two main directions: Domain expansion: We will explore the creation of similar datasets for other domains, such as customer service or education, to assess the applicability of our approach in different contexts. Comparison of language models: We will conduct a comparative study to evaluate the performance of different language models in generating Figure 3: Distribution of "Empathy increase", “Bot empathetic chatbot responses. This study will allow us sentence correctness”, and "Absence of English words in to identify the most suitable language model for this bot sentences" specific task. We believe that this work represents an important 5. Discussion and Conclusion step towards the development of empathetic chatbots In this work, we have presented the creation of a capable of offering a more natural and engaging user dataset of sentences representing typical interactions experience, especially in sensitive contexts such as with a healthcare chatbot. The dataset includes both user healthcare. input sentences and empathetic responses generated by the chatbot. Human validation has confirmed the quality and usefulness of the dataset for developing and Acknowledgements Generation," Proceedings of the 2021 Conference on Empirical Methods in Natural The authors would like to express their sincere Language Processing, pp. 1251-1264, 2021. gratitude to Lutech-Softjam colleagues for their invaluable contributions to the evaluation of the dataset [9] H. Rashkin, E. M. Smith, M. Li and Y.-L. used in this study. Their expertise and meticulous work Boureau, "Towards empathetic open-domain in assessing the dataset’s quality and relevance were conversation models: A new benchmark and instrumental in ensuring the robustness of our research dataset," arXiv preprint arXiv:1811.00207, 2018. findings. [10] A. Welivita, Y. Xie and P. Pu, "A Large- References Scale Dataset for Empathetic Response Generation," Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1251--1264, 2021. [1] Fellous, Jean-Marc and M. A. Arbib, "Who needs emotions?: The brain meets the robot.," [11] H. Touvron, L. Martin, K. Stone, P. Albert, Oxford University Press, 2005. A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale and others, [2] Z. Emmanouil, G. Paraskevopoulos, A. "Llama 2: Open foundation and fine-tuned Katsamanis and A. Potamianos., "EmpBot: A chat models," arXiv preprint arXiv:2307.09288, T5-based Empathetic Chatbot focusing on 2023. Sentiments," arXiv preprint arXiv:2111.00310., 2021. [12] S. C. Gadanho, "Learning Behavior- Selection by Emotions and Cognition in a [3] S. Jamin, P. Xu, A. Madotto and P. Fung, Multi-Goal Robot Task.," Journal of Machine "Generating empathetic responses by looking Learning Research, vol. 1, pp. 385-412, 2003. ahead the user’s sentiment.," in ICASSP 2020- 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020. A. Online Resource [4] F. Liu, Q. Mao, LiangjunWang, N. Ruwa, J. The dataset can be downloaded at Gou and Y. Zhan, "An emotion-based https://github.com/smanai/idre responding model for natural language conversation," Springer Science+Business B. Appendix Media, 2018. Below the prompts used for both steps of dataset creation are shown. [5] Q. Guo, Z. Zhu, Q. Lu, D. Zhang and W. Prompt for QnA Sentence Generation: """genera Wu, "A Dynamic Emotional Session {} coppie di domande utente e risposta di un assistente Generation Model Based on Seq2Seq and a virtuale. Dictionary-Based Attention Mechanism," Le domande devono essere in lingua italiana e Appl. Sci., p. 10, 2020. rappresentano frasi tipiche una persona che vuole informazioni nel dominio "{}". [6] R. Sprugnoli, "MultiEmotions-It: a New Le risposte sono quelle di un tipico chatbot di un call Dataset for Opinion Polarity and Emotion," in center di un'azienda ospedaliera. Proceedings of the Seventh Italian Conference on Le risposte devono solo esporre dei fatti oggettivi e Computational Linguistics, 2020. scientifici ma prive di empatia. la struttura del output deve essere: [7] S. M. Mohammad, "Practical and ethical # considerations in the effective use of emotion utente: and sentiment lexicons," arXiv preprint arXiv, assistente:""" 2020. [8] A. Welivita, Y. Xie and P. Pu, "A Large- Scale Dataset for Empathetic Response Prompt for Empathy Enhancement: """La seguente frase è la risposta di un chatbot di un call center di un ospedale ad una persona che richiede informazioni. La frase è informativa, ma non trasmette empatia per la situazione della persona che chiama. Puoi modificare la seguente frase aggiungendo l'empatia mancante? Puoi modificare la frase aggiungendo testo o modificandolo ma deve mantenere lo stesso significato semantico. la frase modificata deve essere scritta in lingua italiana. Non devi scrivere altro testo oltre alla frase trasformata. inizia la modifica della frase con il carattere "-" come in un elenco puntato. """