IDRE: AI Generated Dataset for Enhancing Empathetic
                                Chatbot Interactions in Italian language.
                                Simone Manai1,2,*,†, Laura Gemme2 ,†, Roberto Zanoli3 and Alberto Lavelli3
                                1
                                  University of Trento, 38123 Trento, Italy
                                2
                                  Lutech-Softjam, 16148 Genova, Italy
                                3
                                  Fondazione Bruno Kessler, 38123 Trento, Italy


                                                    Abstract
                                                    This paper introduces IDRE (Italian Dataset for Rephrasing with Empathy), a novel automatically
                                                    generated Italian linguistic dataset. IDRE comprises typical chatbot user utterances in the healthcare
                                                    domain, corresponding chatbot responses, and empathetically enhanced chatbot responses. The
                                                    dataset was generated using the Llama2 language model and evaluated by human raters based on
                                                    predefined metrics. The IDRE dataset offers a comprehensive and realistic collection of Italian
                                                    chatbot-user interactions suitable for training and refining chatbot models in the healthcare domain.
                                                    This facilitates the development of chatbots capable of natural and productive conversations with
                                                    healthcare users. Notably, the dataset incorporates empathetically enhanced chatbot responses,
                                                    enabling researchers to investigate the effects of empathetic language on fostering more positive and
                                                    engaging human-machine interactions within healthcare settings. The methodology employed for
                                                    the construction of the IDRE dataset can be extended to generate sentences in additional languages
                                                    and domains, thereby expanding its applicability and utility. The IDRE dataset is publicly available
                                                    for research purposes.

                                                    Keywords
                                                    Empathy, LLMs, Llama2, Dataset, Chatbot, Healthcare1


                                1. Introduction                                                        the potential to enhance user experience and create a
                                                                                                       foundation for more sophisticated empathetic dialogue
                                    Emotional intelligence has been widely recognized                  systems.
                                as a crucial factor influencing human communication,                        To facilitate the development of such systems, a
                                impacting aspects such as behavioral choices and the                   robust dataset containing empathetic responses is
                                interpretation of information [1]. Consequently, there                 essential. Despite the increasing body of research on
                                has been a growing interest in developing chatbots                     emotion recognition and generation in human-
                                capable of exhibiting empathetic responses [2] [3] [4].                computer interaction, there is a notable absence of
                                While significant strides have been made in this                       publicly available datasets specifically focused on
                                direction, the integration of empathy into commercial                  empathy in chatbot interactions.
                                chatbots remains challenging due to the rigid                               This paper introduces the IDRE dataset, a new
                                constraints imposed by business rules such as the                      Italian language resource comprising human-bot
                                response must not lose the original meaning and the                    interactions within the healthcare domain. The dataset
                                dialogue must maintain structure.                                      is available publicly, and the address is provided in the
                                    To address this limitation, one possible approach is               Online Resource section. The dataset includes the user
                                to build a layer that rephrases the bot's response by                  questions, original bot responses and corresponding
                                increasing empathy without altering the structure or                   empathetic reformulations for a total of 480 sentences,
                                meaning of the underlying dialogue. This strategy offers               providing a valuable foundation for research and


                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,        0000-0002-7175-6804 (A. Lavelli); 0000-0003-0870-0872 (R.
                                Dec 04 — 06, 2024, Pisa, Italy                                            Zanoli)
                                                                                                                        © 2024 Copyright for this paper by its authors. Use permitted under
                                ∗
                                  Corresponding author.                                                                 Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                †
                                  These authors contributed equally.
                                    simone.manai@unitn.it (S. Manai); l.gemme@softjam.it (L.
                                Gemme); zanoli@fbk.eu (R. Zanoli); lavelli@fbk.eu (A. Lavelli)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Table 1
Examples of Question, Answer and Answer with empathy


development in empathetic chatbot technology, see          question, 7k for the bot's response and 9k for the
Table 1 for an example. The paper also elaborates on the   response with empathy.
methodology employed for dataset generation,
highlighting its applicability to diverse domains and      3.1. Dataset Creation
languages.
                                                                The IDRE dataset comprises triplets of sentences, the
                                                           first sentence represents a user query, the second
2. Related Works                                           sentence is the corresponding response generated by a
     The development of empathetic chatbots capable of     chatbot, and the third sentence is a transformed version
understanding and responding to human emotions             of the second sentence intended to enhance its
represents a research area of growing interest [5].        empathetic tone.
However, building such systems requires high-quality            The sentence generation process was done by the
datasets that include examples of human-machine            Llama2 13B language model [11], operating on an Azure
interactions with empathic components.                     Virtual Machine equipped with four NVIDIA Tesla V100
     Despite the growing availability of datasets for      GPUs. The choice of Llama 2 was motivated by its open-
machine learning and natural language processing, the      source nature, which allowed flexible and provider-
lack of resources dedicated specifically to empathetic     independent access.
Italian-language chatbots represents a significant              The dataset generation process consists of two
challenge.                                                 phases as illustrated in Figure 1:
     There are datasets that contain emotional                  QnA Sentence Generation: To ensure the
information, such as [6] [7] [8] [9]. However, these       generation of empathetic and compassionate responses,
resources focus primarily on labelling words or            the healthcare domain was selected as the focus for the
sentences with generic emotions and do not provide the     initial set of bot-human sentence pairs. This domain,
context for complex, nuanced conversational                characterized by sensitive topics, is well-suited for
interactions like those required for developing            evaluating the model’s ability to generate empathetic
empathetic chatbots.                                       responses.
                                                                The thirteen specific topics chosen for the sentence
3. Dataset                                                 pairs were invented for the purposes of the experiment:
                                                           'information on breast cancer', 'breast cancer
    This chapter details the methodology employed for      prevention', 'therapies for breast cancer', 'psychological
the construction of the IDRE dataset and outlines the      support after a cancer diagnosis', 'life expectancy after a
evaluation process implemented.                            cancer diagnosis', 'psychological support after surgery',
    The dataset created consists of 480 sentences and      'hospital admissions', 'post-operative care', 'information
roughly 18k total tokens divided as follows: 2k for the    on leukemia', 'psychological support', 'anti-cancer
therapies', 'information on stroke', and 'preparation for   The rating scale used is the following:
surgeries.'
    An initial set of bot-human sentence pairs was               1.   Totally disagree
generated using the Llama2 model. These pairs                    2.   Disagree
simulated a typical chatbot interaction concerning a             3.   Neutral
specific health issue or domain. For instance, a human           4.   Agree
query such as "What are the symptoms of COVID-19?"               5.   Totally agree
would elicit a corresponding chatbot response like "The
                                                            The specific metrics used in this evaluation are:
most common symptoms of COVID-19 are fever, dry
cough, and tiredness".
                                                            •    Bot sentence correctness: measures the
    Empathy Enhancement: After the generation of
                                                                 absence of spelling, grammatical, or
the initial sentence pairs, an empathy enhancement
                                                                 punctuation errors in the question and the bot’s
process was undertaken. Leveraging the Llama2 model
                                                                 answer. The question used is: “Il testo della
once more, the chatbot responses were modified to
                                                                 risposta con empatia è corretto sia dal punto di
convey a more empathetic tone. This was achieved by
                                                                 vista grammaticale che semantico.”
prepending expressions of concern or appreciation, and
                                                            •    Absence of English words in bot sentences:
by substituting specific words to engender a supportive
                                                                 checks if there are any words or sentences in
demeanor. To illustrate, the aforementioned chatbot
                                                                 English within the sentences generated by the
response could be transformed into "I understand that
                                                                 model. The question used is: “Nel testo della
you’re concerned about COVID-19. Some common
                                                                 domanda dell’utente e della risposta del bot
symptoms include fever, dry cough, and fatigue".
                                                                 (colonne QUESTION e ANSWER) non sono
    Both prompts are included in the Appendix.
                                                                 presenti parole o frasi in lingua inglese, a meno
                                                                 che non siano di uso comune in italiano (ad
                                                                 esempio “badge”, “sport”, ecc.)”
                                                            •    Empathic answer correctness: measures the
                                                                 absence of spelling, grammatical, or
                                                                 punctuation errors in the bot’s answer with the
                                                                 insertion of empathy. The question used is: “Il
                                                                 testo della risposta con empatia è corretto sia
                                                                 dal punto di vista grammaticale che semantico.”
                                                            •    Absence of English words in empathic
Figure 1: Dataset generation process                             sentences: checks if there are any words or
                                                                 sentences in English within the sentences with
                                                                 empathy generated by the model. The question
3.2. Evaluation Methodology
                                                                 used is: “Nel testo della risposta con empatia
     To ensure the quality of the generated sentences, a         non sono presenti parole o frasi in lingua
rigorous evaluation process was implemented. Twelve              inglese, a meno che non siano di uso comune in
volunteer annotators from Lutech-Softjam, experienced            italiano (ad esempio “badge”, “sport”, ecc.)”
IT developers and project managers with a solid             •    Semantic coherence: measures if the bot’s
understanding of chatbot domain, participated. Despite           answer and the bot’s answer with empathy are
lacking prior experience in linguistic annotation, their         semantically similar. The question used is: “La
familiarity with chatbots significantly accelerated the          risposta con empatia ha lo stesso significato
evaluation process. Before start, they underwent                 semantico della risposta del chatbot. Non ci
comprehensive training on the evaluation task.                   sono concetti mancanti o contraddittori”
     Each evaluator was assigned 70 sentences for           •    Empathy increase: measures if the bot’s
assessment. To ensure diverse evaluations, 40 sentences          answer with empathy has an effective increase
were unique to each evaluator and used for dataset               of empathy compared to the bot’s answer. The
creation, while 30 common sentences were evaluated by            question used is: “La frase nella colonna
all evaluators, solely for measuring agreement and will          ANSWER WITH EMPATHY esprime più
not be part of the dataset. This approach ensured that           empatia rispetto alla frase nella colonna
each sentence received focused evaluation while also             ANSWER”
providing a consistent assessment across evaluators.
     The evaluation process involves the administration
of a metric-specific question, which requires a response
on a scale of 1 to 5.
4. Dataset Analysis
     This section analyses data quality by examining both
the distribution of agreement scores and the level of
inter-annotator agreement (IAA).
     Due to a limited pool of available evaluators, the
dataset was constrained to 480 annotated sentences.
These sentences were evenly distributed among 12
volunteers, each assessing 40 sentences (excluding the
30 sentences used for measuring agreement). This
approach was made to ensure the quality of the                   Figure 2: Scores Distribution for all metrics.
annotations while preventing evaluator fatigue.                      Regarding of analysis of the inter-annotator
Nevertheless, a more in-depth analysis reveals that 223          agreement (IAA) for the annotations generated as
sentences, equal to 46.5% of the total, have the score           outlined in Section 3.2. Fleiss’ kappa coefficient was
grater or equal to 3 on all the metrics considered. This         employed to quantify the level of concordance between
means that these sentences were judged to be of high             multiple annotators while accounting for potential
quality in every aspect analysed. This subset of data can        chance agreement. Kappa values range from -1 to 1, with
be used to finetune language models.                             negative values indicating agreement below chance,
     To obtain a more robust analysis and less subject to        values between 0 and 0.2 representing slight agreement,
small variations, the annotation categories were                 0.21 to 0.4 fair agreement, 0.41 to 0.6 moderate
grouped into three macro-categories: scores 1 and 2,             agreement, 0.61 to 0.8 substantial agreement, and values
score 3 (neutral) and score 4 and 5.                             exceeding 0.8 denoting almost perfect agreement.
     The analysis of sentences with lower score (1 and 2)            The calculation of kappa coefficients on aggregated
revealed three key factors: grammatical errors, the              categories allowed to evaluate the inter-annotator
presence of non-Italian words and lack of a significant          agreement in a more robust way. The results are
increase in empathy as shown in Figure 2.                        summarized in Table 2. Notably, the highest levels of
     Grammatical Errors: A substantial portion of                agreement were observed for metrics related to the
sentences with lower score exhibited grammatical errors          presence of English words. This finding is likely
(words in red). This highlights the importance of                attributable to the relative simplicity of this specific
incorporating robust grammar checks during the                   annotation task. Conversely, metrics assessing other
generation process. Example: “Ohimini, cara/o utente, è          linguistic features exhibited lower, yet still acceptable,
comprensibile che durante il trattamento del tumore possa        levels of agreement, generally falling within the
esserti difficile gestire i sintomi. Sono qui per aiutarti a     moderate range.
trovare soluzioni e supporti per farcela insieme”.
“Ohimini” is a made-up word and “supporti” contains a            Table 2
typo.                                                            Agreement result
     Non-Italian Words: the lower score sentences                          Metrics                   Fleiss    Aggregate
frequently included non-Italian words (words in red),                                                Kappa       Fleiss
primarily English. This deviation from the dataset’s                                                            Kappa
focus on Italian-language interactions can be attributed         Bot sentence correctness             0.608      0.821
to the underlying multilingual language model, which             Absence of English words in          0.781      0.927
was predominantly trained on English text. This                  bot sentences
highlights the need for improved language model                  Empathic answer correctness          0.566       0.807
training to prioritize Italian vocabulary. Example: “Per         Absence of English words in          0.782       0.948
prevenire le infezioni after surgery, è importante seguire le    empathic sentences
istruzioni del medico e del personale ospedaliero, come ad       Semantic coherence                   0.587       0.881
esempio lavare le mani frequentemente, evitare di toccare        Empathy increase                     0.645       0.840
la ferita e utilizzare dispositivi di protezione individuali.”
     Lack of a significant increase in empathy:
                                                                     Figure 3 presents the distribution of annotations for
Among the lower score sentences (173, representing
                                                                 three metrics: “Empathy increase”, “Bot sentence
36%), the transformed responses (indicated by the blue
                                                                 correctness”, and “Absence of English words in bot
and orange columns) did not exhibit a significant rise in
                                                                 sentences”. The distribution for “Absence of English
empathy or indecision compared to the original chatbot
                                                                 words in bot sentences” exhibits a marked concentration
responses. This suggests that further refinement of the
                                                                 towards the highest score (5), indicating a strong
empathy-enhancing techniques might be necessary.
                                                                 consensus among annotators regarding the absence of
English words in bot sentences. In contrast, the              evaluating empathetic chatbots in the healthcare
distribution for “Empathy increase” or “Bot sentence          domain.
correctness” is more dispersed across the entire range of         This work presents a two-pronged contribution to
possible scores, suggesting a greater degree of variability   the field of empathetic chatbots, specifically focusing on
in annotator assessments of bot empathy increase.             the Italian language.
    The observed disparity in distribution patterns               Firstly, it addresses the critical issue of data scarcity
between the metrics can be attributed to the inherent         by providing a high-quality, annotated dataset for
nature of the annotation tasks. The task of identifying       training and evaluating empathetic chatbots within a
the absence of English words in bot sentences is              healthcare context. This dataset can be employed to fine-
relatively straightforward and objective, leading to a        tune large language models (LLMs) such as Llama2,
higher degree of agreement among annotators. On the           enabling them to generate responses with demonstrably
other hand, assessing bot empathy increase involves a         enhanced empathetic qualities. The limitations of non-
more subjective judgment of factors such as                   fine-tuned models are exemplified through the
grammatical accuracy, coherence, and relevance,               observation that they can produce factually incorrect or
resulting in a wider range of annotations.                    unempathetic sentences (e.g., " Il tuo corpo è vulnerabile
    The same behaviour can be noticed with metric             al rischio del tumore al seno a causa della tua età
“Empathic answer correctness”.                                avanzata, nonostante la tua vitalità e forza interiori. La
                                                              storia familiare di tumori al seno nella tua famiglia e la
                                                              tua condizione di obesità possono aumentare il rischio,
                                                              come pure l'abuso di tabacco e alcool. Inoltre, la tua scelta
                                                              di non avere figli o di averli dopo l'età di 35 anni può
                                                              aggiungere ulteriore rischio al tuo corpo."). By leveraging
                                                              the proposed dataset and selecting sentences with
                                                              demonstrably high empathy scores, a targeted training
                                                              set can be constructed specifically for this purpose. This,
                                                              in turn, allows for the fine-tuning of the LLM,
                                                              significantly improving its ability to generate
                                                              empathetic responses in a healthcare setting.
                                                                  Secondly, the work contributes a rigorous human
                                                              validation methodology for evaluating the effectiveness
                                                              of empathy expression in chatbots. This methodology
                                                              provides a valuable tool for researchers and developers
                                                              working in this domain.

                                                              5.1. Future Work
                                                                  In the future, we intend to expand the work in two
                                                              main directions:
                                                                  Domain expansion: We will explore the creation
                                                              of similar datasets for other domains, such as customer
                                                              service or education, to assess the applicability of our
                                                              approach in different contexts.
                                                                  Comparison of language models: We will
                                                              conduct a comparative study to evaluate the
                                                              performance of different language models in generating
Figure 3: Distribution of "Empathy increase", “Bot
                                                              empathetic chatbot responses. This study will allow us
sentence correctness”, and "Absence of English words in
                                                              to identify the most suitable language model for this
bot sentences"
                                                              specific task.
                                                                  We believe that this work represents an important
5. Discussion and Conclusion                                  step towards the development of empathetic chatbots
    In this work, we have presented the creation of a         capable of offering a more natural and engaging user
dataset of sentences representing typical interactions        experience, especially in sensitive contexts such as
with a healthcare chatbot. The dataset includes both user     healthcare.
input sentences and empathetic responses generated by
the chatbot. Human validation has confirmed the quality
and usefulness of the dataset for developing and
Acknowledgements                                                        Generation," Proceedings of the 2021
                                                                        Conference on Empirical Methods in Natural
    The authors would like to express their sincere                     Language Processing, pp. 1251-1264, 2021.
gratitude to Lutech-Softjam colleagues for their
invaluable contributions to the evaluation of the dataset        [9]        H. Rashkin, E. M. Smith, M. Li and Y.-L.
used in this study. Their expertise and meticulous work                 Boureau, "Towards empathetic open-domain
in assessing the dataset’s quality and relevance were                   conversation models: A new benchmark and
instrumental in ensuring the robustness of our research                 dataset," arXiv preprint arXiv:1811.00207, 2018.
findings.
                                                                 [10]       A. Welivita, Y. Xie and P. Pu, "A Large-
References                                                              Scale Dataset for Empathetic Response
                                                                        Generation," Proceedings of the 2021
                                                                        Conference on Empirical Methods in Natural
                                                                        Language Processing, pp. 1251--1264, 2021.
    [1]      Fellous, Jean-Marc and M. A. Arbib, "Who
          needs emotions?: The brain meets the robot.,"
                                                                 [11]       H. Touvron, L. Martin, K. Stone, P. Albert,
          Oxford University Press, 2005.                                A. Almahairi, Y. Babaei, N. Bashlykov, S.
                                                                        Batra, P. Bhargava, S. Bhosale and others,
    [2]      Z. Emmanouil, G. Paraskevopoulos, A.                       "Llama 2: Open foundation and fine-tuned
          Katsamanis and A. Potamianos., "EmpBot: A                     chat models," arXiv preprint arXiv:2307.09288,
          T5-based Empathetic Chatbot focusing on                       2023.
          Sentiments," arXiv preprint arXiv:2111.00310.,
          2021.
                                                                 [12]       S. C. Gadanho, "Learning Behavior-
                                                                        Selection by Emotions and Cognition in a
    [3]       S. Jamin, P. Xu, A. Madotto and P. Fung,                  Multi-Goal Robot Task.," Journal of Machine
          "Generating empathetic responses by looking                   Learning Research, vol. 1, pp. 385-412, 2003.
          ahead the user’s sentiment.," in ICASSP 2020-
          2020 IEEE International Conference on
          Acoustics, Speech and Signal Processing
          (ICASSP), 2020.
                                                             A. Online Resource
    [4]       F. Liu, Q. Mao, LiangjunWang, N. Ruwa, J.      The       dataset   can     be         downloaded        at
          Gou and Y. Zhan, "An emotion-based                 https://github.com/smanai/idre
          responding model for natural language
          conversation," Springer Science+Business           B. Appendix
          Media, 2018.
                                                                  Below the prompts used for both steps of dataset
                                                             creation are shown.
    [5]       Q. Guo, Z. Zhu, Q. Lu, D. Zhang and W.
                                                                  Prompt for QnA Sentence Generation: """genera
          Wu, "A Dynamic Emotional Session
                                                             {} coppie di domande utente e risposta di un assistente
          Generation Model Based on Seq2Seq and a
                                                             virtuale.
          Dictionary-Based Attention Mechanism,"
                                                                  Le domande devono essere in lingua italiana e
          Appl. Sci., p. 10, 2020.
                                                             rappresentano frasi tipiche una persona che vuole
                                                             informazioni nel dominio "{}".
    [6]       R. Sprugnoli, "MultiEmotions-It: a New
                                                                  Le risposte sono quelle di un tipico chatbot di un call
          Dataset for Opinion Polarity and Emotion," in
                                                             center di un'azienda ospedaliera.
          Proceedings of the Seventh Italian Conference on
                                                                  Le risposte devono solo esporre dei fatti oggettivi e
          Computational Linguistics, 2020.
                                                             scientifici ma prive di empatia.
                                                                  la struttura del output deve essere:
    [7]      S. M. Mohammad, "Practical and ethical               #
          considerations in the effective use of emotion          utente:
          and sentiment lexicons," arXiv preprint arXiv,          assistente:"""
          2020.

    [8]       A. Welivita, Y. Xie and P. Pu, "A Large-
          Scale Dataset for Empathetic Response
     Prompt for Empathy Enhancement: """La
seguente frase è la risposta di un chatbot di un call center
di un ospedale ad una persona che richiede informazioni.
La frase è informativa, ma non trasmette empatia per la
situazione della persona che chiama. Puoi modificare la
seguente frase aggiungendo l'empatia mancante?
     Puoi modificare la frase aggiungendo testo o
modificandolo ma deve mantenere lo stesso significato
semantico.
     la frase modificata deve essere scritta in lingua
italiana.
     Non devi scrivere altro testo oltre alla frase
trasformata.
     inizia la modifica della frase con il carattere "-" come
in un elenco puntato.
     """