1. Introduction and Motivations

Assessing Italian Large Language Models on Energy Feedback Generation: A Human Evaluation Study

Manuela Sanguinetti

Alessandro Pani

Alessandra Perniciano

Luca Zedda

Andrea Loddo

Maurizio Atzori

0 Department of Mathematics and Computer Science, University of Cagliari , Italy

This work presents a comparison of some recently-released instruction-tuned large language models for the Italian language, focusing in particular on their efectiveness in a specific application scenario, i.e., that of delivering energy feedback. This work is part of a larger project aimed at developing a conversational interface for users of a renewable energy community, where clarity and accuracy of the provided feedback are important for proper energy management. This comparison is based on the human evaluation of the output produced by such models using energy data as input. Specifically, the data pertains to information regarding the power flows within a household equipped with a photovoltaic (PV) plant and a battery storage system. The goal of the feedback is precisely that of providing the user with such information in a meaningful way based on the specific aspect they intend to monitor at a given moment (e.g., self-consumption levels, the power generated by the PV panels or imported from the main grid, or the battery state of charge). This evaluation experiment has the two-fold purpose of providing an exploratory analysis of the models' abilities on this specific generation task solely relying on the information and instruction provided in the prompt and as an initial investigation into their potential as reliable tools for generating user-friendly energy feedback in this intended scenario.

eol>energy feedback large language models Italian

1. Introduction and Motivations

ural Language Generation (NLG), several studies prior to the advent of Large Language Models (LLMs) invesThe provision of energy feedback plays a crucial role tigated the use of NLG architectures to communicate in promoting energy eficiency among users. The ex- consumption data. Notable works include those by Trivpression energy feedback (or eco-feedback) covers a wide ino and Sanchez-Valdes [ 4 ] and Conde-Clemente et al. range of energy-related information. This can include de- [ 5 ], which used fuzzy sets to tackle data-to-text genertailed reports on energy usage and production (in the case ation tasks, also tailoring the linguistic description on of renewable energy sources), as well as energy-saving given consumption profiles. Similarly, Martínez-Municio advice, whether generic or user-specific. The primary et al. [ 6 ] employed fuzzy sets to produce linguistic sumgoal of energy feedback is to allow users to make in- maries based on the consumption of specific buildings or formed decisions regarding their energy management, groups of buildings, using time series data as input. thus promoting better conservation practices. This work is part of a research project aimed at devel

A substantial body of literature within the field of oping a modular task-oriented conversational agent to Human-Computer Interaction (HCI) has explored vari- inform users about their energy consumption and photoous energy feedback mechanisms, primarily focusing on voltaic (PV) production and, more generally, to support visual or ambient feedback as well as gamification tech- better management of their energy resources through niques (we refer to the surveys proposed by Albertarelli text-based energy feedback. The conversational agent et al. [ 1 ] and Chalal et al. [ 2 ] for further details on these will then be deployed and tested within a renewable enaspects). However, a greater interest has been reported ergy community in Italy, which motivates our specific on the delivery of energy feedback through conversa- focus on Italian as the primary language for the interactional agents [ 3 ]. Furthermore, within the field of Nat- tions. At this stage of the project, we plan to integrate feedback based on actual energy data. quality (using specific criteria that will be defined later)

The main objective of this study thus aims to verify of the energy feedback generated by Italian LLMs. The how efectively instruction-tuned LLMs currently avail- task assigned to the tested models is broadly intended as able for the Italian language can deliver clear and accurate a summarization task in that the expected output is supfeedback based on energy data provided within a prompt, posed to provide a summary of the relevant information without relying on more elaborate techniques like fine- available in the prompt. What follows is the overview of tuning or Retrieval Augmented Generation. More specif- the main principles that guided the selection of the modically, we formulated the following research questions: els, the development of the dataset used for evaluation, and the whole evaluation protocol.

2.1. Models and Setting • Are the LLMs under study able to produce energy feedback that is 1) informative, 2) comprehensible, and 3) accurate with respect to the provided energy data? • Are there any major diferences among such models with respect to these capabilities?

The models’ selection was primarily driven by the in

tended application scenario of the overarching project (also mentioned in the previous section), which narrowed down our choice to Italian models. In addition, we opted

To answer these questions, we conducted an ex- for open-source models that can be run locally, avoiding ploratory analysis by manually evaluating some of these using APIs. For greater simplicity and practicality, we Italian LLMs, organizing the study around criteria de- looked for the Italian models available on HuggingFace, signed to quantify these specific aspects. the reference platform for the release of such resources.

This work closely aligns with a recent initiative that As a final choice, we exclusively selected instructionhas been launched within the Italian NLP community, tuned models. These models are trained to follow a wide i.e., CALAMITA2, a campaign aimed at evaluating the range of instructions provided in the prompt, ofering capabilities of Italian (or multilingual, but including Ital- greater flexibility in handling diverse tasks compared to ian) LLMs on specific tasks in zero or few-shot settings. more specialized fine-tuned models. 5 This ability makes Unlike the latter, however, our study relies solely on hu- them particularly suitable for our purposes. In light of man judgments rather than automatic metrics. The main this, we selected for our study the following models6: challenges of a manual approach include the absence of Cerbero-7B7 [ 11 ], LLaMAntino2-7B [ 12 ], and more specifstandardized practices and evaluation criteria [ 7 ], as well ically the version trained on the UltraChat-ITA dataset8, as the lack of systematic documentation [ 8 ], which hin- LLaMAntino3-8B-ANITA9 [ 13 ], and Zefiro-7B 10. ders the reproducibility of such studies.3 In light of these Regarding the text generation settings, we chose highchallenges, the intended contributions of this paper are temperature values to allow the generation of more dioutlined below: verse responses. Specifically, we set both temperature and _ to 0.9 in order to obtain less deterministic and • A small-scale human evaluation of several Italian more varied outputs. On the other hand, to ensure a bal

LLMs on a specific task. ance between variety and coherence, we kept the _ • The description of a protocol for human eval- value low (0.2). After some preliminary tests, we found uation inspired by the good practices recom- that these settings provided satisfactory results and could mended in recent literature [ 9, 10 ]. To this end, be reasonably used for the actual evaluation phase. As we also make available the evaluation dataset, regards the output length, we limited its maximum to with the ratings assigned by the evaluators in a 250 tokens to prevent excessively lengthy responses and non-aggregated form.4 disabled the option that returns the input prompt as part of the output.

The remainder of this paper describes how this study was designed and carried out, with a discussion of the results obtained and the main limitations of the work.

5It is important to note, however, that depending on the task at hand,

a prompt (even if supplemented with additional examples) may not be suficient to obtain good results, and further model refinements 2. Study Design 6Fmoirghsitmbpelinceitcye,stsharroyu.ghout the paper, only the models’ names will be used, without including parameter specifications or additional As anticipated in the previous section, the main goal of sufixes. this human evaluation experiment is to assess the overall 7https://huggingface.co/galatolo/cerbero-7b 8https://huggingface.co/swap-uniba/ 2https://clic2024.ilc.cnr.it/calamita/ LLaMAntino-2-chat-7b-hf-UltraChat-ITA 3An attempt in this respect is made within the ReproHum project: 9https://huggingface.co/swap-uniba/ https://reprohum.github.io/ LLaMAntino-3-ANITA-8B-Inst-DPO-ITA 4https://github.com/msang/nl-interface/tree/main/humEval 10https://huggingface.co/giux78/zefiro-7b-beta-ITA-v0.1 2.2. Data and Prompts The dataset used for evaluation comprises responses from each of the four models tested. These responses were based on an input prompt consisting of two fixed components — the premise and the instruction — and two dynamic elements: user request and information on energy data (see also Figure 1). ergy usage, battery charge status, or current power generation (e.g., quanto stanno producendo i pannelli?, EN: "how much are the panels producing?"). Furthermore, requests may require brief and concise responses about Fmiogduerles’1c:oPmippealriniseofno.r creating the evaluation dataset used in a single specific information ( quanto è carica la batteria?, EN:"how charged is the battery?"), or more comprehensive overviews (mi serve un quadro completo dei consumi,

Regarding the latter, the data available for the experi- EN:"I need a full overview of the consumption"). ments can vary and is related to the specific use case of The instruction provided in the prompt, aiming to a household equipped with a PV system and a battery reflect the main intended task, was formulated as folstorage solution. In this scenario, the PV system can dis- lows: "Riassumi le informazioni che ti ho appena fornito tribute the energy produced to meet user consumption per rispondere alla seguente domanda: [USER_REQUEST] needs, charge the battery, or feed into the main grid. The (EN: "Summarize the information I have just provided to battery, in turn, can supply power to the user, especially answer the following question"). when there is no solar production. The data presented The final dataset for the evaluation phase comprises 50 in the prompt describes the energy flow among these responses from each model, hence 200 responses overall. diferent sources and is listed in the form of verbal de- The following section provides a detailed description of scriptions, each accompanied by the corresponding data the evaluation process. value and unit of measure (or current status if referred to the battery). This data is summarized in Table 1. In order 2.3. Evaluation Protocol to provide a more realistic depiction of the usage scenario and to introduce a greater variety in the prompt to be The actual evaluation phase was preceded by a briefing processed by the models, the included data encompasses session and a pilot annotation phase. During the briefing, various combinations of values across diferent aspects evaluators discussed the task at hand in order to make (e.g., including greater or lesser household consumption sure they fully understood the evaluation criteria and or solar production or diferent battery charge levels). the meaning of the scale values. Following the briefing,

The user requests were randomly sampled from an a pilot evaluation was carried out. This step allowed in-house dataset for intent detection previously devel- evaluators to familiarize themselves with the process oped to train the NLU module of the conversational agent and refine their understanding of the evaluation criteof the main project.11 The types of user requests used ria. Once these preparatory steps were completed, they in the evaluation focused on typical monitoring func- proceeded with the main evaluation task. They worked tions. These requests primarily aim to check energy con- independently and were not aware of the specific models sumption or production data from the PV panels. They they were evaluating, to mitigate possible biases deriving may be focused on information such as household en- from any preconceived notions of the models.

Four human evaluators, who are co-authors of this paper, conducted the evaluation task. The group comprises three males and one female, each with a back11The backbone architecture of the agent has been developed using

RASA [ 14 ], and the corpus was originally created to train its builtin classifier, DIET [ 15 ]. ground in Computer Science and ranging from graduate To both facilitate the evaluators’ work and ensure an students to assistant professors. While all evaluators are accurate rating for each evaluation criterion, each model familiar with technologies such as conversational agents response was presented alongside the user’s request in and possess a good understanding overall of LLMs, their isolation as well as the entire prompt. This provided them knowledge of concepts related to electricity (e.g., the with the full context needed to carry out the task and distinction between power and energy) and renewable allowed them to understand the information the model energy technologies (such as PV systems and storage had access to during the response generation. Some exsolutions) varies from minimal to substantial. amples of prompts, along with the model’s output and

Evaluators were instructed to assign a Likert-type rat- the evaluation provided by the judges, are reported in ing on a 1-7 scale to each model response for each evalu- Sections A.1-A.2. ation criterion. The rating scale is anchored with symmetrical verbal labels as follows: 1: Strongly Disagree; 2: Disagree; 3: Mildly Disagree; 4: Neither Agree nor Disagree; 3. Results 5: MAsilrdelygaArdgsreteh;e6e:Avaglrueae;ti7o:nSctrroitnegrliya, Athgeryeew.ere designed Once all judges completed the task, we first measured to address the three dimensions outlined in our first re- the Inter-Annotator Agreement using Krippendorf’s .12 search question: informativeness, comprehensibility, and We computed the metric separately for each model and accuracy. These dimensions represent the factors we each evaluation criterion. Results are summarized in deemed essential in the delivery of efective energy feed- Table 3, which also shows the average results both per back; ultimately they are meant to guide the choice of model and criterion. the most suitable model for our intended application sce- The results reveal varying levels of consistency among nario. To evaluate informativeness, we drew inspiration the evaluators, ranging from moderate to low agreement from previous work by Mazzei et al. [ 16 ], considering across all criteria. In particular, Understandability and two complementary aspects: Usefulness, i.e., the extent Fluency exhibit a higher degree of disagreement among to which the information provided by the system is use- the evaluators. This could be due to the subjective naful in responding to the user’s request, and Necessity, ture of these criteria, as diferent evaluators might give i.e., the completeness of the information provided, en- diferent interpretations of what is considered compresuring all necessary details are included. Similarly, to hensible and linguistically fluent. Overall, this variation assess the comprehensibility of the models’ responses, highlights the probable need for more training for evaluawe considered two criteria: Understandability, i.e., the tors to improve their consistency, especially in assessing extent to which the information is presented in an easy- subjective criteria. to-understand manner, and Fluency, i.e., the degree to As for the models’ comparison, we first aggregated all which a text ‘flows well’. The third dimension, Accuracy, ratings assigned in order to provide an overview of the was evaluated based on the degree to which the content models’ output across all five evaluation criteria. Since of an output is correct, accurate, and true relative to the the data is ordinal, we use the median value as an aginput. The definitions of Understandability, Fluency, and gregation function to assess the central tendency of the Accuracy were drawn from the overview proposed in ratings (as also suggested in Amidei et al. [ 9 ]). The results, Howcroft et al. [ 7 ]. For each of these five criteria, evalua- shown in Table 4, indicate medium to high ratings overtors were asked to assign a rating within the proposed all across all models. To thus answer our first research scale, guided by a specific question associated with each criterion (see Table 2). 12We used the statistical package K-Alpha Calculator [17]: https: //www.k-alpha.org/ question, we examined the overall medians for each eval- cally significant, and the comparisons were carried out uation criterion. The values obtained show that they separately for each evaluation criterion. This prelimiperform reasonably well despite the variability across nary test confirmed that the diferences observed are the models. Concerning the dimension of informative- indeed significant, considering a standard threshold of ness, ratings range from 4 to 6 in Usefulness and from 5 to < 0.05. However, the Kruskal-Wallis test does not 7 in Necessity, suggesting that further refinements might determine which models are significantly diferent from be necessary to ensure that the energy feedback delivered each other. Therefore, we proceeded with pairwise comis useful and complete. In terms of comprehensibility, the parisons using Dunn’s test. This test confirmed a sigcorresponding criteria show that all models are capable nificant diference between LLaMAntino2 and the other of generating responses that are easily understandable three models. and fluent, which are both relevant factors that might contribute to a more enjoyable user experience in view Table 5 of the possible integration of such models in a conver- P-values obtained with pairwise comparisons between LLasational interface. Also as regards Accuracy, the energy MAntino2 and the remaining models, using Dunn’s test, and feedback generated by the models is generally correct, adjusted using Bonferroni correction. with only one exception (LLaMAntino2). This indicates Cerbero LLaMAntino3 that, overall, the models provide accurate and reliable information, another important factor when users have Usefulness 5e-04 1e-08 7e-08 to make informed decisions based on that feedback. Necessity 3e-12 2e-03 4e-04

To answer our second research question, we then con- Understandability 3e-07 1e-03 9e-08 sidered the overall diferences among the models. As FAlcuceunrcaycy 25ee--0146 31ee--0120 51ee--0029 also shown in Table 4, LLaMAntino2 quite consistently received lower ratings, particularly for Usefulness and Accuracy, while the other models received high ratings Table 5 shows the p-values obtained by comparing this overall, suggesting that they might be considered com- model with the other three for each evaluation criterion. parable. To inspect this further, we carried out some The remaining comparisons yielded p-values well above statistical tests. We first used the Kruskal-Wallis test, a the 0.05 threshold, therefore the null hypothesis cannon-parametric test suitable for ordinal data, to compare not be rejected for those cases. The other three models the distributions of more than two independent groups. can thus be considered comparable based on the ratings We used it to determine whether the diferences among assigned by the evaluators in our experiment. the median values obtained for the models were statistiZefiro

4. Conclusions and Limitations

13https://github.com/RSTLess-research/Fauno-Italian-LLM aims to ensure that the core principles of the experiment are flexible enough to be easily replicated or adapted for This study provides an initial assessment of several Ital- a wider range of diferent domains. ian language models’ ability to generate efective energy feedback. The results indicate that while the models generally perform well, particularly in terms of comprehensi- Acknowledgments bility and accuracy, there is greater variability regarding informativeness. Among the tested models, results show This work has been developed within the framework that, except for LLaMAntino2-7B-UltraChat, the remain- of the project e.INS- Ecosystem of Innovation for Next ing ones provide comparable performances. However, Generation Sardinia (cod. ECS 00000038) funded by the it is important to highlight the limitations of this study. Italian Ministry for Research and Education (MUR) unFirst, this is a small-scale study, as it involves a limited der the National Recovery and Resilience Plan (NRRP) number of models and evaluators. Concerning the former - MISSION 4 COMPONENT 2, "From research to busiissue, we also point out that the study was restricted to ness" INVESTMENT 1.5, "Creation and strengthening of models available on HuggingFace, excluding potentially Ecosystems of innovation" and construction of "Territorelevant models from external sources, such as Fauno13 rial R&D Leaders". This work was also partially funded and Camoscio [18]. A more systematic study should con- under the National Recovery and Resilience Plan (NRRP) sider these models as well, in order to provide a more - Mission 4 Component 2 Investment 1.3, Project code comprehensive evaluation over the Italian LLMs’ land- PE0000021, “Network 4 Energy Sustainable Transition– scape. As for the pool of evaluators, it is important to note NEST”. a significant bias in both their personal backgrounds and demographics. All the judges have a background in com- References puter science and varying degrees of familiarity with the topics at hand. Furthermore, there is a gender imbalance (1 female and 3 male judges) and a lack of age diversity, as all four judges fall within the 24–30 age range. In light of these considerations, a more systematic comparison as the one envisioned above would benefit from a broader and more diverse pool of evaluators. This would not only increase the reliability of the comparison but also provide a deeper understanding of potential correlations between socio-demographic factors, prior knowledge of technology and energy-related concepts, and the difering perceptions of the evaluation criteria considered in our study. Common approaches to address the lack of human participants include the use of crowdsourcing platforms, with a careful design of participation criteria that would ensure a better gender and demographic balance. Alternatively, a user study involving prospective users of the conversational agent could be conducted; this would ultimately enable to gather valuable insights on the type of feedback expected by the target audience.

Finally, an extended evaluation framework should also include an analysis of the statistical power of the sample size to ensure more robust conclusions.

Despite these limitations, this work ofers a preliminary overview and aims to pave the way for future research on this aspect, also stressing the importance of more standardized human evaluation practices. As a matter of fact, the evaluation protocol we designed draws heavily from methodologies recommended in more general literature pertaining to human evaluation within generation and summarization tasks. Our approach thus - potenza prodotta ora dall’impianto fotovoltaico: 0.0 kW - potenza utilizzata ora dalla casa: 0.36 kW A.1. Example with response generated by - stato di carica attuale della batteria: 53% LLaMAntino2 - status della batteria: in scarica - potenza fornita dalla batteria: 0.32 kW Prompt <s>[INST] «SYS» Sei un an assistente AI per - potenza acquistata dalla rete: 0.04 kW la lingua italiana. Rispondi nella lingua usata per la do- - energia utilizzata dalla casa in tutta la giornata: 14.45 manda in modo chiaro, diretto e completo. Attieniti stret- kWh tamente alle istruzioni fornite e riporta la tua risposta nel - energia totale prodotta dai pannelli in tutta la giornata: formato richiesto, senza aggiungere ulteriori commenti 36.909 kWh o spiegazioni. «/SYS» - energia acquistata dalla rete in tutta la giornata: 0.55

Considera le seguenti informazioni relative kWh all’andamento di un impianto fotovoltaico: - auto-consumo della giornata: 13.90 kWh - potenza prodotta ora dall’impianto fotovoltaico: 0.0 kW - energia immessa in rete in tutta la giornata: 23.01 kWh - potenza utilizzata ora dalla casa: 0.23 kW - stato di carica attuale della batteria: 55% Riassumi le informazioni che ti ho appena fornito per - status della batteria: in scarica rispondere alla seguente domanda: livello di carica della - potenza immessa in rete: 0.15 kW batteria al momento</s> <|assistant|> - potenza fornita dalla batteria: 0.38 kW - energia utilizzata dalla casa in tutta la giornata: 14.45 Response "La batteria ha un livello di carica del 53% al kWh momento, che indica che ci sono circa 53% di carica nella - energia totale prodotta dai pannelli in tutta la giornata: batteria." 36.909 kWh - energia acquistata dalla rete in tutta la giornata: 0.55 kWh - auto-consumo della giornata: 13.90 kWh - energia immessa in rete in tutta la giornata: 23.01 kWh

Evaluation

Riassumi le informazioni che ti ho appena fornito per rispondere alla seguente domanda: voglio avere un resoconto della produzione energetica dei pannelli solari [/INST] Response "Mi dispiace, non posso fornire una risposta perché la vostra domanda non è specifica. Per ottenere una risposta precisa, è necessario fornire ulteriori informazioni."

Evaluation

Prompt <|system|> Sei un an assistente AI per la lingua italiana. Rispondi nella lingua usata per la domanda in modo chiaro, diretto e completo. Attieniti strettamente alle istruzioni fornite e riporta la tua risposta nel formato richiesto, senza aggiungere ulteriori commenti o spiegazioni.</s> <|user|> Considera le seguenti informazioni relative all’andamento di un impianto fotovoltaico:

[1]

Albertarelli ,

Fraternali ,

Herrera ,

Melenhorst ,

Novak ,

Pasini ,

A.-E.

Rizzoli ,

Rottondi , A Survey on the Design of Gamified Systems for Energy and Water Sustainability, Games 9 ( 2018 ). doi: 10 .3390/g9030038.

[2]

Chalal ,

Medjdoub ,

Bezai ,

Bull ,

Zune , Visualisation in Energy Eco-Feedback Systems: A Systematic Review of Good Practice, Renewable and Sustainable Energy Reviews 162 ( 2022 ). doi: 10 . 1016/j.rser. 2022 . 112447 .

[3]

Sanguinetti ,

Atzori , Conversational Agents for Energy Awareness and Eficiency: A Survey, Electronics 13 ( 2024 ). doi: 10 .3390/ electronics13020401.

[4]

Trivino , D. Sanchez-Valdes, Generation of Linguistic Advices for Saving Energy: Architecture , in: A. -H. Dediu , L. Magdalena , C. Martín-Vide (Eds.), Theory and Practice of Natural Computing , Springer International Publishing, Cham, 2015 , pp. 83 - 94 .

[5]

Conde-Clemente ,

J. M.

Alonso , G. Trivino, Toward Automatic Generation of Linguistic Advice for Saving Energy at Home , Soft Computing 22 ( 2018 ) 345 - 359 . doi: 10 .1007/s00500-016-2430-5.

[6]

Martínez-Municio ,

Rodríguez-Benítez ,

Castillo-Herrera ,

Giralt-Muiña , L. JiménezLinares, Linguistic Modeling and Synthesis of Heterogeneous Energy Consumption Time Series Sets:, International Journal of Computational Intelligence Systems 12 ( 2018 ) 259 . doi: 10 .2991/ijcis. 2018 . 125905639 .

[7]

D. M.

Howcroft ,

Belz , M. -

A. Clinciu , D.

Gkatzia , THMS. 2022 .3184400.

S. A.

Hasan ,

Mahamood ,

Mille , E. Van Mil- [17]

Marzi ,

Balzano ,

Marchiori , K-alpha caltenburg , S. Santhanam, V. Rieser , Twenty Years of culator-krippendorf's alpha calculator: A userConfusion in Human Evaluation: NLG Needs Eval- friendly tool for computing krippendorf's alpha uation Sheets and Standardised Definitions, in: Pro- inter-rater reliability coeficient, MethodsX 12 ceedings of the 13th International Conference on ( 2024 ) 102545 . doi:https://doi.org/10.1016/ Natural Language Generation, Association for Com- j.mex. 2023 . 102545 . putational Linguistics, Dublin, Ireland, 2020 , pp. [18]

Santilli , E. Rodolà, Camoscio: An Italian 169-182 . doi: 10 .18653/v1/ 2020 .inlg- 1 . 23 . Instruction-tuned

LLaMA

, in: F. Boschetti, G. E.

[8]

Shimorina ,

Belz , The Human Evaluation Lebani,

Magnini , N. Novielli (Eds.), Proceedings Datasheet: A Template for Recording Details of of the 9th Italian Conference on Computational LinHuman Evaluation Experiments in NLP , in: A. Belz, guistics, Venice, Italy, November 30 - December 2,

Popović ,

Reiter , A . Shimorina (Eds.), Proceed- 2023 , volume 3596 of CEUR Workshop Proceedings, ings of the 2nd Workshop on Human Evaluation of CEUR-WS.org , 2023 . NLP Systems (HumEval), Association for Computational Linguistics , Dublin, Ireland, 2022 , pp. 54 - 75 . doi: 10 .18653/v1/ 2022 .humeval- 1 .6.

[9]

Amidei ,

Piwek ,

Willis , The Use of Rating and Likert Scales in Natural Language Generation Human Evaluation Tasks: A Review and some Recommendations , in: Proceedings of the 12th International Conference on Natural Language Generation , Association for Computational Linguistics, Tokyo, Japan, 2019 , pp. 397 - 402 . doi: 10 .18653/ v1/ W19 -8648.

[10] C. Van Der Lee , A. Gatt , E. Van Miltenburg ,

Krahmer , Human evaluation of automatically generated text: Current trends and best practice guidelines , Computer Speech & Language 67 ( 2021 ) 101151 . doi: 10 .1016/j.csl. 2020 . 101151 .

[11]

F. A.

Galatolo , M. G. C. A. Cimino, Cerbero-7B: A Leap Forward in Language-Specific LLMs Through Enhanced Chat Corpus Generation and Evaluation, 2023 . arXiv: 2311 . 15698 .

[12]

Basile , E. Musacchio,

Polignano ,

Siciliani , G. Fiameni, G. Semeraro, LLaMAntino: LLaMA 2 Models for Efective Text Generation in Italian Language , 2023 . arXiv: 2312 . 09993 .

[13]

Polignano ,

Basile , G.

Semeraro, Advanced Natural-based interaction for the ITAlian language: LLaMAntino-3-

ANITA , 2024 . arXiv: 2405 . 07101 .

[14]

Bocklisch ,

Faulkner ,

Pawlowski ,

Nichol , Rasa: Open Source Language Understanding and Dialogue Management , CoRR abs/1712 .05181 ( 2017 ). arXiv: 1712 . 05181 .

[15]

Bunk ,

Varshneya ,

Vlasov ,

Nichol , DIET: Lightweight Language Understanding for Dialogue Systems , CoRR abs/ 2004 .09936 ( 2020 ). arXiv: 2004 .09936.

[16]

Mazzei ,

Anselma ,

Sanguinetti ,

Rapp ,

Mana , M. M. Hossain , V.

Patti , R.

Simeoni , L. Longo, Anticipating User Intentions in Customer Care Dialogue Systems , IEEE Transactions on Human-Machine Systems ( 2022 ). doi: 10 .1109/