1. Introduction

Can LLMs Help Recollect and Elaborate On Our Personal Experiences?

Gabriel Roccabruna

0 1

Olha Khomyn

0 1

Michele Yin

0 1 2

Giuseppe Riccardi

0 1 0 CLiC-it 2025: Eleventh Italian Conference on Computational Linguistics 1 Signals and Interactive Systems Lab, Department of Information Engineering and Computer Science, University of Trento 2 Work done while he was working at the University of Trento

2025

In the act of narration, speakers engage with others, communicate findings, and share personal facts and knowledge. This act involves recollecting and reasoning about thoughts and events. Individuals need to plan and organize events and associated emotions in a temporal and logical order. These recollection processes are cognitively demanding and emotion-laden. In this work, we investigate whether Large Language Models (LLMs) may help and support the process of personal narration, i.e. in elaborating on the unfolding events, participants, and emotions. For this, we test LLMs' abilities on a novel task called Automatic NarraTive Elicitation (ANTE). We have crowdsourced a corpus of elicitation responses in the Italian language using a pre-existing dataset of personal narratives. We used this dataset to evaluate a set of closed and open-source LLMs with automatic and human-evaluation metrics. The human evaluation results show that GPT-4 achieves performance similar to humans', while smaller open-source LLMs struggle with this task. We investigate whether fine-tuning smaller open-source LLMs improves performance by experimenting with mixing crowd-sourced and synthetic data.

eol>Personal Narrative Large Language Models Elicitation Emotions Conversational Agent

1. Introduction

The act of narration manifests in written or spoken conversations. It is generally used to communicate facts, knowledge and personal events. This act involves recollecting and reasoning about thoughts and events. Indeed, the narrative has been widely used in journalism [ 1 ], education [ 2 ], and economics [ 3 ]. In psychology, the analysis of personal narratives is a research tool used in many ifelds such as rehabilitation [ 4 ], managing psychosis [ 5 ], investigating language dysfunctions [ 6 ], and monitoring the variation of the emotional state during psychotherapy [ 7, 8 ]. A Personal Narrative (PN) is a series of unfolding events recounting the social interactions, emotions, experiences and others lived by the narrator [ 9 ]. In this sense, a PN is a way to observe the interpretation of the world from the narrator’s perspective [ 10, 11 ].

Currently, the collection of personal narratives is mainly based on textual stimuli or interviews. In the textual stimuli approach, the narrators recount or write down in complete isolation an event [ 12 ] recollected by a crafted eliciting prompt based on valence-charged words (e.g. friendship or death) or questions [ 13, 14 ]. However,

Eliciting prompt "What did you do this

weekend?" Poppy, my dog, and I went to a cottage in the mountains close to my home. While hiking, we ran into a bear close to the river we needed to

cross! I was super worried for my dog! Oh, that’s sound scary! How did you get around the bear to reach the cottage? I picked up Poppy and slowly walked back without giving it our shoulders. Luckily, there was another point to cross the river and we

reached the cottage safely.

That was smart! Did Poppy look scared on way

to the cottage? the act of narration may be a cognitively demanding and emotionally intense process, leading some individuals to related to QG because the model has to generate a quesget stuck with the narration or to recount overgeneral- tion given a context, but in ANTE the target answer is ized memories, overlooking important details of the story unknown. Thus, the ANTE task has no predecessors to [ 15, 12 ]. While human-human conversation has been the best of our knowledge, but previous research in QG shown to alleviate these issues [ 16, 17, 18 ], the potential is still relevant. GPT-2 [24] has been on the generation role of Large Language Models (LLM) in supporting this of clarifying questions by experimenting with several process remains underexplored. Indeed, the recent sug- zero-shot prompts grounding the generation on a list gested improvements in the safety, biases and toxicity of facets which are possible directions for an ambigu[ 19, 20 ] and in natural language fluency [ 21 ] make these ous query [25]. A BART model [26] has been used to models suitable candidates for this task. generate questions based on a storybook summary for

To help narrators recollect and elaborate on personal improving intellectual development in children [27]. In events, LLMs must understand the unfolding events, par- the healthcare domain, a combination of T5 and BERT ticipants, and emotions encompassed in the Personal models has been used in the task of asking patients with Narrative (PN). In this work, we investigate whether depression questions for triage [28]. LLMs have these capabilities by evaluating their performance on a novel task called Automatic NarraTive Data Augmentation with LLMs Recently, there has Elicitation (ANTE). In this, to support the elaboration been increased attention on using the LLMs for data augof personal events, the model is tasked to generate em- mentation. [29] have leveraged several LLMs to augment pathetic eliciting responses pointing to a specific aspect three multilingual datasets. Similarly, [ 30 ] have develof the recount. We crowdsource a corpus of more than oped an augmentation method based on GPT-3 [ 31 ] and 500 eliciting responses in the Italian language starting in-context learning to generate a dataset of synthetic from a pre-existing dataset of PNs. On this, we evaluate dialogues. Related to this, the ability of LLMs to gen5 open and closed-source LLMs with in-context learning. erate Socratic questions, i.e. questions for helping stuThe human evaluation has shown that while GPT-4 [22] dents solve a problem without revealing the answer, has achieves on-pair performance with the human reference, been investigated [ 32 ]. For this, the authors augmented a all the open-source models lag behind. As closed-source dataset with GPT-4 [22] and fine-tuned Llama2 [ 33 ] with LLMs may have privacy issues and not be afordable reinforcement learning. over the long run, we explore whether fine-tuning small open-source LLMs can reduce the gap. For this, we augment the training set with a partition generated by GPT- 3. Automatic Narrative Elicitation 4. We then experiment with diferent combinations of partitions (crowd-sourced vs synthetic data) during finetuning. The results show that fine-tuning with synthetic data improves the performance of all models, closing the gap with the human reference.

Our contributions can be summarized as follows:

We envision a hybrid methodology for eliciting Personal

Narratives (PN), which joins the benefits of textual stimuli and interview approaches. The elicitation, depicted in Figure 1, starts with an eliciting prompt such as a crafted textual stimulus. Then, once the narrator finishes the first part of the recount an agent asks a follow-up response • Definition of a novel LLM skill for supporting that helps continue the narration by elaborating on some personal narrations; aspect of the story. These exchanges go on till a certain • Proposed guidelines and procedure for collect- criterion is met, depending on the application (e.g. based ing the Automatic Narrative Elicitation (ANTE) on the narrative length), or the narrator explicitly wants corpus; to stop.

Formally, a prompt elicits the main event of • Automatic and human evaluation of 5 LLMs fol- the PN. This is followed by a sequence = lowing in-context learning and fine-tuning strate- [(1, 1), ..., (, )], where is a narrative turn gies; at time and is the corresponding eliciting response. • Human evaluation protocol with two task- consists of feedback and an eliciting question. The specific metrics for the ANTE task; feedback must show active listening and be aligned with the expressed narrator’s emotions. Furthermore, the eliciting response must focus on relevant events mentioned in ( 1 ) without significantly altering the lfow of the narration.

The Automatic NarraTive Elicitation (ANTE) task is defined as:

2. Related Works Question Generation Question Generation (QG) is a natural language processing task in which a model is tasked to generate a question given a context and a target answer [23]. Automatic NarraTive Elicitation (ANTE) is

Definition 3.1. Given the sequence [(1, 1), ..., ], the model generates a such that elicits the narrator to continue with the story by yielding a +1.

This task implicitly requires an emotional and semantic understanding of the narrative. Furthermore, it implicitly requires the ability to select the events that might be valuable to support the continuation of the narration. 4. Data Collection We have experimented with 5 closed-sourced and open

source LLMs, namely GPT-4, Llama3 8B [39], Vicuna 13B [40], LLaMAntino 13B [41], and IT5 [42]. The

Similarly, we have included the description of undesir

able properties, such as asking for personal opinions, and hypothetical events, giving suggestions or shifting the 1https://github.com/sislab-unitn/ANTE focus of the conversation away from the narrated event. 2https://www.prolific.com/ Furthermore, to help the annotator focus the question 3We used gpt-4-turbo

6. Evaluation 6.1. Metrics

selection of the models has only considered LLMs supporting the Italian language i.e. the language of the ANTE dataset. IT5 is pre-trained on the Italian dataset, while LLaMAntino 13B based on Llama2 [ 33 ] is fine-tuned on the Italian language using LoRa [43]. Instead, Llama3 8B and Vicuna 13B are pre-trained on a multi-lingual dataset.

We have evaluated the models on the ANTE task both

with automatic and human evaluation metrics. We have used the automatic metric to have a proxy for performance estimates during the development of the models, i.e. before the resource-demanding human evaluation. As 5.1. In-Context Learning an automatic evaluation metric, we have used the BLEU In-context learning, or few-shot learning, is a technique 1 score [44]. Regarding the human evaluation, we have in which the model can learn from a few examples pro- adopted a human evaluation protocol developed for evalvided in the context [ 31 ]. In our case, five pairs (5-shot) uating dialogue models in a reproducible and comparable of narratives and corresponding eliciting responses are way [45]. From this, we have used the Appropriateness, given to the model. In particular, we have used the same Contextualization and Correctness metrics5. Each metric examples written in the guidelines for collecting the is translated into a question to which the annotators can dataset. answer Yes, No, or I don’t know. Furthermore, the annotaThe input to the model is formalized as: tors can provide explanations for a negative answer for some metrics. For contextualization, the annotators can ⊕ { 11, 11 ⊕ ... ⊕ 15, 15} ⊕ justify their negative answer with wrong or no references to the grounding context representing hallucination and where I are the instructions for the model, ⊕ is the genericness, respectively. concatenation with the new line (\n), 1, 1 are i-shot While the proposed metrics are enough for evaluating example of the narrative and the corresponding eliciting generic dialogue models, we need specific criteria for response at the first turn of the dialogue, N is the input better evaluating the models on our task. Specifically, narrative that the model should generate the response to. we introduced Efectiveness and Compliance. Efectiveness The beginning of the narrative and the response are indi- evaluates whether the response is efective in helping the cated with two marker tokens, namely “Narrative:” and narrator continue with the narration naturally. The two “Response:” 4. We have also experimented with adding possible explanations for being an inefective response the annotation guidelines before the instructions for the are that the question is either generic (generic question) or model, but observed only an increase in inference time complex (complex question), which means the narrators and not in performance. will have dificulties in answering that question. Diferent from the genericness in contextualization, a generic 5.2. Fine-tuning response can still be efective when the context is not enough for asking a more specific question. Compliance In training, the input sequences consist of a narrative evaluates whether the response is compliant with the and the corresponding eliciting response, concatenated annotation guidelines, i.e. it has the properties listed in with the new line (\n). Additionally, we add two marker Section 4. tokens to the input prompt to indicate the beginning of Additionally, in the HE, we have added ground truth the narrative and the response, respectively. eliciting responses along with those generated as a point Formally, the input sequence is: of reference and an additional control step [45]. Moreover, as for the data collection, we have split the evalua : ⊕ : tions into batches of five narratives. Each batch has been annotated by five crowd workers hired via Prolific and where N is the narrative, ⊕ is the concatenation with paid £9 per hour. Furthermore, we used an overlap of 20% the new line and R is the corresponding eliciting response. to compute the agreement, whose overall score is 0.34 In fine-tuning the open-source LLMs, the input of the measured with Fleiss’ [ 46 ], showing a fair agreement. autoregressive models is as described above, while for the sequence-to-sequence IT5 model, the input to the encoder and decoder is narrative and eliciting response, respectively. All the hyperparameters used to fine-tune and test the models are reported in Appendix A.

4An example of a real prompt is reported in Appendix A in Table 5. 5Appropriateness whether the response makes sense w.r.t the dia

logue history; Contextualization whether the response contains references to the dialogue context; Correct whether the response is grammatically and syntactically correct. 6.2. Automatic Evaluation thermore, while fine-tuning the models on the merged and synthetic datasets always degrades the performance Table 2 reports the BLEU 1 score for each model at- measured on the gold test set, it generally increases the tained with in-context learning and fine-tuning on crowd- scores on the silver test set. Finally, Llama3 8B fine-tuned sourced, merged and synthetic datasets. As ground truth, on the synthetic dataset achieves the best BLEU score on we use both gold and silver eliciting responses coming the silver test set. from the crowdsourced and synthetic test sets, respec- According to these results, Llama3 8B and IT5 should tively. have similar performance on the ANTE task. Notwith

From the results of the in-context learning experi- standing, recent studies have shown that automatic ments, we observe that GPT-4 outperforms all the other metrics are poorly correlated with human judgement models by efectively leveraging the provided examples [ 47, 48, 45 ]. For this reason, we have used human evaluawith few shots. Fine-tuned on the crowdsourced dataset, tion to have a more realistic representation of the LLMs’ Vicuna 13B and IT5 outperform GPT-4 with ICL, achiev- performance. ing the highest results on the gold test set overall. Fur

Appropriateness

Contextualization ILC Correctness

Compliance

Efectiveness .s Appropriateness dw Contextualization ro Correctness .C Compliance T F Efectiveness d Appropriateness eg Contextualization r e Correctness .TM Compliance F Efectiveness ic Appropriateness t eh Contextualization tn Correctness y .S Compliance FT Efectiveness

6.3. Human Evaluation The results of the human evaluation are presented in

Table 3. Similarly to the automatic evaluation, the table shows the results achieved with ICL and fine-tuning on crowdsourced, merged and synthetic datasets. The values represent the percentage of eliciting responses that received a positive evaluation for the corresponding metric. Considering the limited size of the test set (57 examples) and the unavoidable subjectivity and ambiguity in the evaluation process, the results are compared with a coarse margin that we empirically set to ± 5. Along with manual inspection, this is also supported by the percentage of “I don’t know” options, catching the ambiguous cases, which ranges from 3.5% for human reference to 9.1% for Vicuna 13B on average.

The results in the ICL setting show that the ANTE task is challenging also for crowd workers (human reference) who in some cases could not refrain from giving suggestions or asking for personal information (e.g. What’s the name of your kid?). Moreover, GPT-4 achieves onpair performance with human annotators on all metrics but compliance since the model gave suggestions similar to the human reference. Given the overall positive scores, we have used GPT-4 to generate the synthetic data. Regarding the other models, the gap with human reference is overall large. Only LLaMAntino 13B and Vicuna 13B achieve decent performance on the two taskspecific metrics compliance and efectiveness . Moreover, the scores on correctness suggest that only LLaMAntino 13B and GPT-4 can properly handle the Italian language in this task without fine-tuning.

Fine-tuning especially boosts the performance of

IT5 and Llama3 8B, while more contained improvements

are observed for LLaMAntino 13B and Vicuna 13B. Moreover, LLaMAntino 13B and Llama3 8B achieve their best results when fine-tuned on the synthetic dataset, whilst IT5 and Vicuna 13B perform the best when fine-tuned on the merged dataset. In particular, Llama3 8B finetuned on the synthetic dataset attains an improvement of 35% on average w.r.t. ICL results, outperforming all the other open-source LLMs and matching the performance on the task-specific metrics of human annotators and GPT-4. Although a lower performance gain, 10% on average, LLaMAntino 13B is the second-best model on the ANTE task, matching GPT-4 performance on efectiveness and correctness. Regarding the correctness metric, we can observe that IT5 always achieves the lowest score, but on the merged dataset, despite being pre-trained on a corpus in the Italian language.

All in all, fine-tuning with synthetic data (either merged or synthetic datasets) improves the performance of almost all the models. Indeed, the scores of the taskspecific metrics achieved by fine-tuning the models on the crowdsourced dataset are lower on average than those achieved with merged and synthetic datasets. A possible explanation for these improvements is that the merged dataset is larger; therefore, a small model such as IT5 (220M parameters) benefits from this.

6.4. Error Analysis Since the human evaluation has shown that GPT-4 matches the Human Reference’s (HR) performance, we have run some analysis to characterize the similarities and diferences better. We have started by manually com

paring the eliciting responses of GPT-4 and HR. In this, that the cases of hallucination and genericness on the we observed that GPT-4 tends to use paraphrased parts synthetic dataset are minimized compared to fine-tuning of the narrative in the feedback and question parts of on the crowdsourced dataset. The improvement is even the eliciting response. Indeed, the Jaccard similarity [ 49 ] more evident comparing the errors of IT5 fine-tuned on between the narrative and the eliciting response6 on av- crowdsourced and merged datasets, where the number of erage is 13% for GPT-4 and 7% for HR. After that, we generic questions is halved, and the hallucination cases investigate whether there is a challenging set of exam- decrease by 11%. All in all, we can observe that the major ples on which both models make errors by considering source of errors for contextualization and efectiveness is an eliciting response wrong when it received negative due to either hallucination or genericness, regardless of feedback on at least one metric. The intersection of the the dataset used during fine-tuning. errors is only the 7% of the narrative, while the cases We have investigated whether the performance gap in which HR is correct and GPT-4 is wrong are 20% and between fine-tuning on crowdsourced and synthetic vice versa are 13%. By analysing all these errors man- datasets is due to a diference in the learning complexually, we observed that in some cases GPT-4 deducted ity. In other words, learning from synthetic data may be the context wrongly such as “I was having a cofee with easier than learning from human-generated data. Our a colleague and we were talking about Christmas when...” rationale is that the distribution learned by LLMs, during and the model asked7 “Have you already decided what to pre-training, is more similar to the distribution of syngift for Christmas?”. Overall, one of the main issues is thetic data than that of human-generated data. This is due to suggestions or requests for personal information because LLMs are based on similar architectures, and the negatively afecting the performance on appropriateness relative pre-training datasets may overlap. For this, we and compliance. have used the entrainment statistic because of the dif

The distributions of the explanations that annotators ferent vocabularies, making measuring the distribution gave to justify their negative evaluations for the metrics distance challenging. Entrainment is the phenomenon contextualization (wrong and no references) and efective- in which, during a conversation, a speaker reuses the ness (complex or generic questions) are depicted in Figure terms of the other interlocutor [ 50 ]. This phenomenon 2. HR and GPT-4 errors are reported as references in may also be seen during the training process, where a all plots. We can observe that HR is penalized on con- model learns to use the same language as the training set. textualization and efectiveness due to genericness in the We have measured the entrainment using the formula responses. On the GPT-4 side, the negative score on efec- proposed by Hirschberg et al. [ 51 ], which is: tiveness is mainly due to complex questions. Furthermore, the percentage of errors classified as wrong references is () = − ∑∑︀︀∈∈ ||11 (())−+22 (())|| (1) zero for both HR and GPT-4, meaning that GPT-4 does not hallucinate in this task. The opposite is observed in the ICL experiments where Llama3 8B has been penalized on contextualization mainly due to wrong references, i.e., the model hallucinated some part of the eliciting response.

Moreover, for the same model, the efectiveness score is negatively afected by many generic questions. As for human evaluation, the distributions of the errors show that fine-tuning the models improves the performance, especially with synthetic data. In this, we can observe where is a target word class and is the frequency of the word used by the model 1 and the test set responses 2. The resulting score ranges between 0 (perfect match) and -1 (mismatch). We used the 100 most frequent words computed on the joint responses generated by 1 and 2.

Specifically, as 1, we have used the responses generated by either Llama3 8B or LLaMAntino 13B8 fine-tuned on crowdsourced (FTC) and synthetic (FTS) datasets. As 2, we have used the responses either in the crowdsourced (CT) or the synthetic (ST) test sets. From Table

6From both, we removed the stopwords and lemmatized the rest.

7In this case, the model wrongly inferred that Christmas is yet to come, which is impossible to say by looking at the context only. The model should have focused on other parts of the narrative.

8The two best-performing models. To test the models in a real-case scenario, we have de

veloped a Virtual Reality (VR) system for the collection of personal narratives. The collection follows the same procedure as depicted in Figure 1, which starts with an eliciting prompt and is followed by a conversation between a narrator and an embodied conversational agent. The system consists of an automatic speech recognition [ 52 ] model, a conversational agent based on our bestperforming LLM (Llama3 8B), which generates eliciting responses, and a text-to-speech model. To connect these components, we have utilized an adaptation of the architecture proposed by Yin et al. [ 53 ], which also employs a strategy of input segmentation to minimize response latency. After some internal tests, we have observed that the dialogue is efective and the system’s response latency is not a major issue. However, the turn-taking strategy is rule-based and, therefore, studying a more efective approach would make the conversation smoother9.

8. Conclusions In this work, we evaluated 5 LLMs on the Automatic

NarraTive Elicitation (ANTE) task to investigate whether the models can help us elaborate and recollect personal events. To do this, we collected and created three corpora, namely crowdsourced, merged, and synthetic. Then, we evaluated closed and open-source models with in-context learning and fine-tuning on the ANTE task. The results show that closed-source LLMs can perform similarly to human annotators and that fine-tuned open-source LLMs on synthetic data can achieve similar performance. This suggests that LLMs may be used to support individuals in recollecting and elaborating on personal events.

A future work is to study the efectiveness of LLMs in collecting personal narratives compared to standard techniques such as textual stimuli or interviews in a random controlled trial setting. Another is to study how to instruct the model to steer the conversation toward specific events relevant to the researchers or professionals collecting the narratives.

9A demo of this system can be found at https://www.youtube.com/

watch?v=ozpuoEKsTjs Acknowledgments

4, we can observe that the entrainment scores computed

between FTC and CT are lower than those computed between FTS and ST. Thus, the fine-tuned models are more We acknowledge the support of the MUR PNRR project aligned with the language of the synthetic dataset than FAIR - Future AI Research (PE00000013) and the MUR the natural language found in the crowdsourced dataset, PNRR project iNEST- Interconnected Nord-Est Innovasuggesting that learning from the synthetic data is easier. tion Ecosystem (ECS00000043) funded by the European Union under NextGenerationEU. Views and opinions expressed are however those of the author(s) only and do 7. Personal Narratives in VR not necessarily reflect those of the European Union or The European Research Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

A. Appendix A.1. Hyperparameters We used a batch size of 8 for the fine-tuning. The models

were fine-tuned for 10 epochs with early stopping based on the perplexity computed on the development set. We have trained the autoregressive models, Vicuna 13B, LLaMAntino 13B, Llama3 8B, in an auto-regressive manner with Adam [ 54 ] optimizer. The models were fine-tuned using Low-Rank Adaptation (LoRA) [43], i.e. a method for fine-tuning large-scale LLMs, which reduces the number of trainable parameters. We set the learning rate to Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) and Grammarly in order to: Paraphrase and reword and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

[1]

T. B.

Connery , A sourcebook of american literary journalism: representative writers in an emerging genre ( 1992 ).

[2]

Hobbs ,

Davis , Narrative pedagogies in science, mathematics and technology, Res. Sci. Educ . 43 ( 2013 ) 1289 - 1305 .

[3]

R. J.

Shiller , Narrative economics: How stories go viral and drive major economic events , Princeton University Press, 2020 .

[4] K. D'Cruz , J. Douglas , T. Serry, Personal narrative approaches in rehabilitation following traumatic brain injury: A synthesis of qualitative research , Neuropsychological Rehabilitation 29 ( 2019 ) 985 - 1004 .

[5]

C. N.

Wiesepape ,

J. T.

Lysaker ,

S. E.

Queller ,

P. H.

Lysaker , Personal narratives and the pursuit of purpose and possibility in psychosis: directions for developing recovery-oriented treatments , Expert Review of Neurotherapeutics 23 ( 2023 ) 525 - 534 .

[6]

Botting , Narrative as a tool for the assessment of linguistic and pragmatic impairments, Child language teaching and therapy 18 ( 2002 ) 1 - 21 .

[7]

Danieli ,

Ciulli ,

S. M.

Mousavi , G. Silvestri,

Barbato ,

L. Di

Natale , G. Riccardi, Assessing the impact of conversational artificial intelligence in the treatment of stress and anxiety in aging adults: randomized controlled trial , JMIR mental health 9 ( 2022 ) e38067 .

[8]

Roccabruna ,

S. M.

Mousavi , G. Riccardi, Understanding emotion valence is a joint deep learning task , in: Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis , 2023 , pp. 85 - 95 .

[9]

Tammewar ,

Cervone ,

E.-M.

Messner , G. Riccardi, Annotation of emotion carriers in personal narratives , in: Proceedings of the Twelfth Language Resources and Evaluation Conference , European Language Resources Association, Marseille, France, 2020 , pp. 1517 - 1525 . URL: https://aclanthology.org/ URL: https://aclanthology.org/ 2024 . naacl-long . 341 . 2020 .lrec- 1 .189. doi: 10 .18653/v1/ 2024 . naacl-long . 341 .

[10]

T. R.

Sarbin , The narrative as a root metaphor for [22]

Achiam ,

Adler ,

Agarwal ,

Ahmad , I. Akkaya, psychology, Narrative psychology: The storied

F. L.

Aleman ,

Almeida ,

Altenschmidt , S.

Altnature of human conduct (

1986 ) 1 - 27 . man,

Anadkat , et al., Gpt-4 technical report,

[11]

Neisser ,

Fivush , The remembering self: Con- arXiv preprint arXiv:2303.08774 ( 2023 ). struction and accuracy in the self-narrative, 6 , Cam- [23]

Qiu ,

Xiong , Generating highly relevant quesbridge University Press, 1994 . tions, in: Proceedings of the 2019 Conference

[12]

Mills , S. D'Mello, On the validity of the auto- on Empirical Methods in Natural Language Probiographical emotional memory task for emotion cessing and the 9th International Joint Conferinduction , PloS one 9 ( 2014 ) e95837. ence on Natural Language Processing (EMNLP-

[13] J. M. Williams , K. Broadbent , Autobiographical

IJCNLP

), Association for Computational Linguismemory in suicide attempters ., Journal of abnormal tics, Hong Kong, China , 2019 , pp. 5983 - 5987 . URL: psychology 95 ( 1986 ) 144 . https://aclanthology.org/D19-1614. doi: 10 .18653/

[14]

D. C.

Rubin , Remembering our past: Studies in auto- v1/ D19 -1614. biographical memory, Cambridge University Press, [24]

Radford , J. Wu ,

Child ,

Luan ,

Amodei , 1999 . I. Sutskever , et al., Language models are unsuper-

[15] R. J. McNally , N. B.

Lasko , M. L.

Macklin , R. K. Pit- vised multitask learners, OpenAI blog 1 ( 2019 ) 9 . man, Autobiographical memory disturbance in [25]

Wang ,

Tu ,

Rosset ,

Craswell ,

Wu ,

Ai , combat -related posttraumatic stress disorder, Be- Zero-shot clarifying question generation for conhaviour research and therapy 33 ( 1995 ) 619 - 630 . versational search, in: Proceedings of the ACM

[16]

Borrini ,

Dall'Ora ,

S. Della

Sala ,

Marinelli , Web Conference 2023 , 2023 , pp. 3288 - 3298 . H. Spinnler, Autobiographical memory . sensitiv- [26]

Lewis ,

Liu ,

Goyal , M. Ghazvininejad, ity to age and education of a standardized enquiry, A . Mohamed , O.

Levy , V.

Stoyanov , L. ZettlePsychological Medicine 19 ( 1989 ) 215 - 224 . moyer, BART: Denoising sequence-to-sequence

[17] M. D. Kopelman , B.

Wilson , A. D.

Baddeley , The pre-training for natural language generation, transautobiographical memory interview: a new assess- lation, and comprehension, in: Proceedings of the ment of autobiographical and personal semantic 58th Annual Meeting of the Association for Commemory in amnesic patients , Journal of clinical putational Linguistics , Association for Computaand experimental neuropsychology 11 ( 1989 ) 724 - tional Linguistics, Online, 2020 , pp. 7871 - 7880 . URL: 744 . https://aclanthology.org/ 2020 .acl-main. 703 . doi:10.

[18]

Levine ,

Svoboda ,

J. F.

Hay , G. Winocur, 18653 /v1/ 2020 .acl-main.703. M. Moscovitch , Aging and autobiographical mem- [27] Z.

Zhao , Y.

Hou , D.

Wang , M.

Yu , C.

Liu , X. Ma, ory : dissociating episodic from semantic retrieval ., Educational question generation of children stoPsychology and aging 17 ( 2002 ) 677. rybooks via question type distribution learning

[19]

Inan ,

Upasani ,

Chi ,

Rungta ,

Iyer , and event-centric summarization , in: ProceedY. Mao,

Tontchev ,

Hu ,

Fuller , D. Testuggine, ings of the 60th Annual Meeting of the Associaet al ., Llama guard: Llm-based input-output safe- tion for Computational Linguistics (Volume 1: Long guard for human-ai conversations , arXiv preprint Papers) , Association for Computational LinguisarXiv:2312.06674 ( 2023 ). tics, Dublin, Ireland, 2022 , pp. 5073 - 5085 . URL:

[20]

Rebedea ,

Dinu ,

M. N.

Sreedhar , C. Parisien, https://aclanthology.org/ 2022 . acl-long . 348 . doi:10. J. Cohen, Nemo guardrails: A toolkit for control- 18653 /v1/ 2022 . acl-long.348. lable and safe llm applications with programmable [28]

Gupta ,

Agarwal ,

Gaur , K. Roy, rails, in: Proceedings of the 2023 Conference on V. Narayanan , P.

Kumaraguru , A.

Sheth , Learning Empirical Methods in Natural Language Processing: to automate follow-up question generation using System Demonstrations, 2023 , pp. 431 - 445 . process knowledge for depression triage on reddit

[21]

Ou ,

Lu , C. Liu,

Tang ,

Zhang , D. Zhang, posts, in: Proceedings of the Eighth Workshop on K. Gai, DialogBench: Evaluating LLMs as human- Computational Linguistics and Clinical Psychology, like dialogue systems , in: K. Duh,

Gomez , 2022 , p. 137 . S. Bethard (Eds.), Proceedings of the 2024 Con- [29]

Whitehouse ,

Choudhury ,

A. F.

Aji , LLMference of the North American Chapter of the powered data augmentation for enhanced crossAssociation for Computational Linguistics: Hu- lingual performance , in: H. Bouamor , J. Pino, man Language Technologies (Volume 1 : Long

Bali (Eds.), Proceedings of the 2023 ConPapers) , Association for Computational Linguis- ference on Empirical Methods in Natural Lantics , Mexico City, Mexico, 2024 , pp. 6137 - 6170 . guage Processing, Association for Computational Linguistics, Singapore, 2023 , pp. 671 - 686 . standards, Springer, 2017 , pp. 109 - 135 . URL: https://aclanthology.org/ 2023 .emnlp-main. 44 . [37]

Roccabruna ,

Cervone , G. Riccardi, Multifuncdoi: 10 .18653/v1/ 2023 . emnlp-main.44. tional iso standard dialogue act tagging in italian,

[30]

Li ,

Chen ,

Li ,

Wang ,

Qian ,

Yan , in: CLiC-it, 2020 . Controllable dialogue simulation with in-context [38] J .,

F. E.

Kelley ,

T. J.

Watson , An iterative design learning , in: Y. Goldberg , Z.

Kozareva , Y.

Zhang methodology for user-friendly natural language of(Eds.), Findings of the Association for Computa- ifce information applications , ACM Trans. Inf. Syst. tional Linguistics: EMNLP 2022 , Association for 2 ( 1984 ) 26 - 41 . URL: https://api.semanticscholar. Computational Linguistics , Abu Dhabi, United org/CorpusID:207660078. Arab

Emirates

, 2022 , pp. 4330 - 4347 . URL: https:// [39]

Grattafiori ,

Dubey , A. J. et al., The llama 3 aclanthology .org/ 2022 .findings-emnlp. 318 . doi: 10. herd of models, 2024 . URL: https://arxiv.org/abs/ 18653/v1/ 2022 .findings-emnlp. 318 . 2407.21783. arXiv: 2407 . 21783 .

[31]

Brown ,

Mann , R. et al., Language mod- [40] W.-L. Chiang , Z.

Li , Z.

Lin , Y.

Sheng , Z.

Wu, els are few-shot learners , in: H. Larochelle , H.

Zhang , L. Zheng, S.

Zhuang , Y.

Zhuang , J. E. M.

Ranzato , R.

Hadsell , M.

Balcan , H. Lin (Eds.), Gonzalez, I.

Stoica , E. P.

Xing , Vicuna: An openAdvances in Neural Information Processing source chatbot impressing gpt-4 with 90%* chatSystems, volume 33 , Curran

Associates

, Inc., gpt quality, 2023 . URL: https://lmsys.org/blog/ 2020, pp. 1877 - 1901 . URL: https://proceedings. 2023- 03 -30-vicuna/. neurips.cc/paper_files/paper/2020/file/ [41]

Basile , E. Musacchio,

Polignano , L. Siciliani, 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. G. Fiameni, G. Semeraro, Llamantino: Llama 2

[32]

N. Ashok

Kumar ,

Lan , Improving socratic ques- models for efective text generation in italian lantion generation using data augmentation and pref - guage, 2023 . URL: https://arxiv.org/abs/2312.09993. erence optimization, in: E. Kochmar,

Bexte , arXiv: 2312 .09993. J. Burstein , A.

Horbach , R. Laarmann-Quante, [42] G.

Sarti , M.

Nissim, IT5: Text-to-text pretraining for A . Tack , V. Yaneva , Z. Yuan (Eds.), Proceedings Italian language understanding and generation , in: of the 19th Workshop on Innovative Use of NLP N. Calzolari , M.-

Kan ,

Hoste ,

Lenci , S.

Sakti, for Building Educational Applications (BEA

2024 ), N. Xue (Eds.), Proceedings of the 2024 Joint InAssociation for Computational Linguistics , Mex- ternational Conference on Computational Linguisico City, Mexico , 2024 , pp. 108 - 118 . URL: https: tics, Language Resources and Evaluation (LREC//aclanthology .org/ 2024 .bea- 1 .10. COLING 2024), ELRA and ICCL , Torino , Italia, 2024 ,

[33]

Touvron ,

Martin ,

Stone ,

Albert , A . Alma- pp. 9422 - 9433 . URL: https://aclanthology.org/ 2024 . hairi, Y. Babaei,

Bashlykov ,

Batra , P. Bhargava, lrec-main.823.

Bhosale , et al., Llama 2 : Open foundation and fine- [43]

E. J.

Hu ,

Shen ,

Wallis ,

Allen-Zhu ,

Li , tuned chat models , arXiv preprint arXiv:2307 .09288

Wang ,

Chen , Lora: Low-rank adap ( 2023 ). tation of large language models , 2021 . URL: https:

[34]

S. M.

Mousavi ,

Cervone ,

Danieli , G. Riccardi, //arxiv.org/abs/2106.09685. arXiv: 2106 .09685. Would you like to tell me more? generating a [44]

Papineni ,

Roukos ,

Ward , W.-J. Zhu, Bleu: a corpus of psychotherapy dialogues, in: Proceed- method for automatic evaluation of machine transings of the Second Workshop on Natural Language lation, in: Proceedings of the 40th Annual MeetProcessing for Medical Conversations, Association ing on Association for Computational Linguistics, for Computational Linguistics , Online, 2021 , pp. ACL '02 , Association for Computational Linguis1-9 . URL: https://aclanthology.org/ 2021 .nlpmc- 1 .1. tics, USA, 2002 , p. 311 - 318 . URL: https://doi.org/ doi:10.18653/v1/ 2021 .nlpmc- 1 .1. 10.3115/1073083.1073135. doi: 10 .3115/1073083.

[35]

S. M.

Mousavi ,

Roccabruna ,

Tammewar , S. Az- 1073135 . zolin, G. Riccardi, Can emotion carriers explain au- [45]

S. M.

Mousavi , G. Roccabruna,

Lorandi , S. Caltomatic sentiment prediction? a study on personal darella, G. Riccardi, Evaluation of response generanarratives , in: Proceedings of the 12th Workshop on tion models: Shouldn't it be shareable and repliComputational Approaches to Subjectivity, Senti- cable?, in: A. Bosselut , K.

Chandu , K.

Dhole, ment & Social Media Analysis, Association for Com- V.

Gangal , S.

Gehrmann , Y.

Jernite , J. Novikova, putational Linguistics, Dublin, Ireland, 2022 , pp. 62 - L . Perez-Beltrachini (Eds.), Proceedings of the 2nd 70 . URL: https://aclanthology.org/ 2022 .wassa- 1 .6. Workshop on Natural Language Generation, Evaludoi: 10 .18653/v1/ 2022 .wassa- 1 .6. ation, and Metrics (GEM), Association for Com-

[36]

Bunt ,

Petukhova ,

Traum , J. Alexanders- putational Linguistics , Abu Dhabi, United Arab son, Dialogue act annotation with the iso 24617-2 Emirates (Hybrid ), 2022 , pp. 136 - 147 . URL: https: standard, in: Multimodal interaction with W3C //aclanthology .org/ 2022 .gem- 1 .12. doi: 10 .18653/ v1/ 2022 .gem- 1 . 12 . 1e − 5, rank and alpha parameters to 128 . We have used

[46]

J. L.

Fleiss , Measuring nominal scale agreement the top-k sampling strategy to generate the new tokens among many raters., Psychological bulletin 76 with k set to 10. The IT5 model was fully fine-tuned ( 1971 ) 378. with Adafactor [55] optimizer. We have used a beam

[47]

Belz ,

Mille ,

D. M.

Howcroft , Disentangling the search with four beams as a decoding strategy. To run properties of human evaluation methods: A clas- our experiments, we used a machine with two Nvidia sification system to support comparability, meta- 3090 with 24GB and an Nvidia A100 with 80GB. Overall, evaluation and reproducibility testing, in: Proceed- the training time for each experiment was less than 30 ings of the 13th International Conference on Natu- minutes, and the inference time was less than 15 minutes. ral Language Generation, Association for Computational Linguistics , Dublin, Ireland, 2020 , pp. 183 - 194 . URL: https://aclanthology.org/ 2020 .inlg- 1 . 24 .

[48] A. B. Sai , A. K. Mohankumar , M. M. Khapra , A survey of evaluation metrics used for nlg systems , ACM Computing Surveys (CSUR) 55 ( 2022 ) 1 - 39 .

[49]

Jaccard , Nouvelles recherches sur la distribution lforale , Bull. Soc. Vaud. Sci. Nat . 44 ( 1908 ) 223 - 270 .

[50]

S. E.

Brennan , et al., Lexical entrainment in spontaneous dialog , Proceedings of ISSD 96 ( 1996 ) 41 - 44 .

[51] J. B. Hirschberg , A.

Nenkova , A.

Gravano , High frequency word entrainment in spoken dialogue ( 2008 ).

[52]

Grosman , Fine-tuned XLSR-53 large model for speech recognition in Italian , https://huggingface.co/jonatasgrosman/ wav2vec2-large-xlsr-53-italian, 2021 .

[53]

Yin ,

Roccabruna ,

Azad , G. Riccardi, Let's give a voice to conversational agents in virtual reality , in: Proceedings of Interspeech 2023 , Dublin, Ireland, 2023 , pp. 5247 - 5248 .

[54]

D. P.

Kingma ,

Ba , Adam: A method for stochastic optimization , 2017 . URL: https://arxiv.org/abs/1412. 6980. arXiv: 1412 . 6980 .

[55]

Shazeer ,

Stern , Adafactor: Adaptive learning rates with sublinear memory cost , in: J. Dy , A . Krause (Eds.), Proceedings of the 35th International Conference on Machine Learning , volume 80 of Proceedings of Machine Learning Research, PMLR , 2018 , pp. 4596 - 4604 . URL: https://proceedings.mlr. press/v80/shazeer18a.html.