<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Assessing Italian Large Language Models on Energy Feedback Generation: A Human Evaluation Study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Manuela Sanguinetti</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Pani</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandra Perniciano</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Zedda</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Loddo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maurizio Atzori</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics and Computer Science, University of Cagliari</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This work presents a comparison of some recently-released instruction-tuned large language models for the Italian language, focusing in particular on their efectiveness in a specific application scenario, i.e., that of delivering energy feedback. This work is part of a larger project aimed at developing a conversational interface for users of a renewable energy community, where clarity and accuracy of the provided feedback are important for proper energy management. This comparison is based on the human evaluation of the output produced by such models using energy data as input. Specifically, the data pertains to information regarding the power flows within a household equipped with a photovoltaic (PV) plant and a battery storage system. The goal of the feedback is precisely that of providing the user with such information in a meaningful way based on the specific aspect they intend to monitor at a given moment (e.g., self-consumption levels, the power generated by the PV panels or imported from the main grid, or the battery state of charge). This evaluation experiment has the two-fold purpose of providing an exploratory analysis of the models' abilities on this specific generation task solely relying on the information and instruction provided in the prompt and as an initial investigation into their potential as reliable tools for generating user-friendly energy feedback in this intended scenario.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;energy feedback</kwd>
        <kwd>large language models</kwd>
        <kwd>Italian</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Motivations</title>
      <p>
        ural Language Generation (NLG), several studies prior
to the advent of Large Language Models (LLMs)
invesThe provision of energy feedback plays a crucial role tigated the use of NLG architectures to communicate
in promoting energy eficiency among users. The ex- consumption data. Notable works include those by
Trivpression energy feedback (or eco-feedback) covers a wide ino and Sanchez-Valdes [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Conde-Clemente et al.
range of energy-related information. This can include de- [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which used fuzzy sets to tackle data-to-text
genertailed reports on energy usage and production (in the case ation tasks, also tailoring the linguistic description on
of renewable energy sources), as well as energy-saving given consumption profiles. Similarly, Martínez-Municio
advice, whether generic or user-specific. The primary et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] employed fuzzy sets to produce linguistic
sumgoal of energy feedback is to allow users to make in- maries based on the consumption of specific buildings or
formed decisions regarding their energy management, groups of buildings, using time series data as input.
thus promoting better conservation practices. This work is part of a research project aimed at
devel
      </p>
      <p>
        A substantial body of literature within the field of oping a modular task-oriented conversational agent to
Human-Computer Interaction (HCI) has explored vari- inform users about their energy consumption and
photoous energy feedback mechanisms, primarily focusing on voltaic (PV) production and, more generally, to support
visual or ambient feedback as well as gamification tech- better management of their energy resources through
niques (we refer to the surveys proposed by Albertarelli text-based energy feedback. The conversational agent
et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Chalal et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for further details on these will then be deployed and tested within a renewable
enaspects). However, a greater interest has been reported ergy community in Italy, which motivates our specific
on the delivery of energy feedback through conversa- focus on Italian as the primary language for the
interactional agents [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Furthermore, within the field of Nat- tions. At this stage of the project, we plan to integrate
feedback based on actual energy data. quality (using specific criteria that will be defined later)
      </p>
      <p>The main objective of this study thus aims to verify of the energy feedback generated by Italian LLMs. The
how efectively instruction-tuned LLMs currently avail- task assigned to the tested models is broadly intended as
able for the Italian language can deliver clear and accurate a summarization task in that the expected output is
supfeedback based on energy data provided within a prompt, posed to provide a summary of the relevant information
without relying on more elaborate techniques like fine- available in the prompt. What follows is the overview of
tuning or Retrieval Augmented Generation. More specif- the main principles that guided the selection of the
modically, we formulated the following research questions: els, the development of the dataset used for evaluation,
and the whole evaluation protocol.</p>
      <p>2.1. Models and Setting
• Are the LLMs under study able to produce energy
feedback that is 1) informative, 2)
comprehensible, and 3) accurate with respect to the provided
energy data?
• Are there any major diferences among such
models with respect to these capabilities?</p>
      <sec id="sec-1-1">
        <title>The models’ selection was primarily driven by the in</title>
        <p>tended application scenario of the overarching project
(also mentioned in the previous section), which narrowed
down our choice to Italian models. In addition, we opted</p>
        <p>To answer these questions, we conducted an ex- for open-source models that can be run locally, avoiding
ploratory analysis by manually evaluating some of these using APIs. For greater simplicity and practicality, we
Italian LLMs, organizing the study around criteria de- looked for the Italian models available on HuggingFace,
signed to quantify these specific aspects. the reference platform for the release of such resources.</p>
        <p>
          This work closely aligns with a recent initiative that As a final choice, we exclusively selected
instructionhas been launched within the Italian NLP community, tuned models. These models are trained to follow a wide
i.e., CALAMITA2, a campaign aimed at evaluating the range of instructions provided in the prompt, ofering
capabilities of Italian (or multilingual, but including Ital- greater flexibility in handling diverse tasks compared to
ian) LLMs on specific tasks in zero or few-shot settings. more specialized fine-tuned models. 5 This ability makes
Unlike the latter, however, our study relies solely on hu- them particularly suitable for our purposes. In light of
man judgments rather than automatic metrics. The main this, we selected for our study the following models6:
challenges of a manual approach include the absence of Cerbero-7B7 [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], LLaMAntino2-7B [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], and more
specifstandardized practices and evaluation criteria [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], as well ically the version trained on the UltraChat-ITA dataset8,
as the lack of systematic documentation [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], which hin- LLaMAntino3-8B-ANITA9 [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], and Zefiro-7B 10.
ders the reproducibility of such studies.3 In light of these Regarding the text generation settings, we chose
highchallenges, the intended contributions of this paper are temperature values to allow the generation of more
dioutlined below: verse responses. Specifically, we set both temperature
and _ to 0.9 in order to obtain less deterministic and
• A small-scale human evaluation of several Italian more varied outputs. On the other hand, to ensure a
bal
        </p>
        <p>
          LLMs on a specific task. ance between variety and coherence, we kept the _
• The description of a protocol for human eval- value low (0.2). After some preliminary tests, we found
uation inspired by the good practices recom- that these settings provided satisfactory results and could
mended in recent literature [
          <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
          ]. To this end, be reasonably used for the actual evaluation phase. As
we also make available the evaluation dataset, regards the output length, we limited its maximum to
with the ratings assigned by the evaluators in a 250 tokens to prevent excessively lengthy responses and
non-aggregated form.4 disabled the option that returns the input prompt as part
of the output.
        </p>
        <p>The remainder of this paper describes how this study
was designed and carried out, with a discussion of the
results obtained and the main limitations of the work.</p>
      </sec>
      <sec id="sec-1-2">
        <title>5It is important to note, however, that depending on the task at hand,</title>
        <p>a prompt (even if supplemented with additional examples) may not
be suficient to obtain good results, and further model refinements
2. Study Design 6Fmoirghsitmbpelinceitcye,stsharroyu.ghout the paper, only the models’ names will
be used, without including parameter specifications or additional
As anticipated in the previous section, the main goal of sufixes.
this human evaluation experiment is to assess the overall 7https://huggingface.co/galatolo/cerbero-7b
8https://huggingface.co/swap-uniba/
2https://clic2024.ilc.cnr.it/calamita/ LLaMAntino-2-chat-7b-hf-UltraChat-ITA
3An attempt in this respect is made within the ReproHum project: 9https://huggingface.co/swap-uniba/
https://reprohum.github.io/ LLaMAntino-3-ANITA-8B-Inst-DPO-ITA
4https://github.com/msang/nl-interface/tree/main/humEval 10https://huggingface.co/giux78/zefiro-7b-beta-ITA-v0.1
2.2. Data and Prompts
The dataset used for evaluation comprises responses from
each of the four models tested. These responses were
based on an input prompt consisting of two fixed
components — the premise and the instruction — and two
dynamic elements: user request and information on
energy data (see also Figure 1).
ergy usage, battery charge status, or current power
generation (e.g., quanto stanno producendo i pannelli?, EN:
"how much are the panels producing?"). Furthermore,
requests may require brief and concise responses about
Fmiogduerles’1c:oPmippealriniseofno.r creating the evaluation dataset used in a single specific information ( quanto è carica la batteria?,
EN:"how charged is the battery?"), or more
comprehensive overviews (mi serve un quadro completo dei consumi,</p>
        <p>Regarding the latter, the data available for the experi- EN:"I need a full overview of the consumption").
ments can vary and is related to the specific use case of The instruction provided in the prompt, aiming to
a household equipped with a PV system and a battery reflect the main intended task, was formulated as
folstorage solution. In this scenario, the PV system can dis- lows: "Riassumi le informazioni che ti ho appena fornito
tribute the energy produced to meet user consumption per rispondere alla seguente domanda: [USER_REQUEST]
needs, charge the battery, or feed into the main grid. The (EN: "Summarize the information I have just provided to
battery, in turn, can supply power to the user, especially answer the following question").
when there is no solar production. The data presented The final dataset for the evaluation phase comprises 50
in the prompt describes the energy flow among these responses from each model, hence 200 responses overall.
diferent sources and is listed in the form of verbal de- The following section provides a detailed description of
scriptions, each accompanied by the corresponding data the evaluation process.
value and unit of measure (or current status if referred to
the battery). This data is summarized in Table 1. In order 2.3. Evaluation Protocol
to provide a more realistic depiction of the usage scenario
and to introduce a greater variety in the prompt to be The actual evaluation phase was preceded by a briefing
processed by the models, the included data encompasses session and a pilot annotation phase. During the briefing,
various combinations of values across diferent aspects evaluators discussed the task at hand in order to make
(e.g., including greater or lesser household consumption sure they fully understood the evaluation criteria and
or solar production or diferent battery charge levels). the meaning of the scale values. Following the briefing,</p>
        <p>The user requests were randomly sampled from an a pilot evaluation was carried out. This step allowed
in-house dataset for intent detection previously devel- evaluators to familiarize themselves with the process
oped to train the NLU module of the conversational agent and refine their understanding of the evaluation
criteof the main project.11 The types of user requests used ria. Once these preparatory steps were completed, they
in the evaluation focused on typical monitoring func- proceeded with the main evaluation task. They worked
tions. These requests primarily aim to check energy con- independently and were not aware of the specific models
sumption or production data from the PV panels. They they were evaluating, to mitigate possible biases deriving
may be focused on information such as household en- from any preconceived notions of the models.</p>
        <p>Four human evaluators, who are co-authors of this
paper, conducted the evaluation task. The group
comprises three males and one female, each with a
back11The backbone architecture of the agent has been developed using</p>
        <p>
          RASA [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], and the corpus was originally created to train its
builtin classifier, DIET [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
ground in Computer Science and ranging from graduate To both facilitate the evaluators’ work and ensure an
students to assistant professors. While all evaluators are accurate rating for each evaluation criterion, each model
familiar with technologies such as conversational agents response was presented alongside the user’s request in
and possess a good understanding overall of LLMs, their isolation as well as the entire prompt. This provided them
knowledge of concepts related to electricity (e.g., the with the full context needed to carry out the task and
distinction between power and energy) and renewable allowed them to understand the information the model
energy technologies (such as PV systems and storage had access to during the response generation. Some
exsolutions) varies from minimal to substantial. amples of prompts, along with the model’s output and
        </p>
        <p>
          Evaluators were instructed to assign a Likert-type rat- the evaluation provided by the judges, are reported in
ing on a 1-7 scale to each model response for each evalu- Sections A.1-A.2.
ation criterion. The rating scale is anchored with
symmetrical verbal labels as follows: 1: Strongly Disagree; 2:
Disagree; 3: Mildly Disagree; 4: Neither Agree nor Disagree; 3. Results
5: MAsilrdelygaArdgsreteh;e6e:Avaglrueae;ti7o:nSctrroitnegrliya, Athgeryeew.ere designed Once all judges completed the task, we first measured
to address the three dimensions outlined in our first re- the Inter-Annotator Agreement using Krippendorf’s  .12
search question: informativeness, comprehensibility, and We computed the metric separately for each model and
accuracy. These dimensions represent the factors we each evaluation criterion. Results are summarized in
deemed essential in the delivery of efective energy feed- Table 3, which also shows the average results both per
back; ultimately they are meant to guide the choice of model and criterion.
the most suitable model for our intended application sce- The results reveal varying levels of consistency among
nario. To evaluate informativeness, we drew inspiration the evaluators, ranging from moderate to low agreement
from previous work by Mazzei et al. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], considering across all criteria. In particular, Understandability and
two complementary aspects: Usefulness, i.e., the extent Fluency exhibit a higher degree of disagreement among
to which the information provided by the system is use- the evaluators. This could be due to the subjective
naful in responding to the user’s request, and Necessity, ture of these criteria, as diferent evaluators might give
i.e., the completeness of the information provided, en- diferent interpretations of what is considered
compresuring all necessary details are included. Similarly, to hensible and linguistically fluent. Overall, this variation
assess the comprehensibility of the models’ responses, highlights the probable need for more training for
evaluawe considered two criteria: Understandability, i.e., the tors to improve their consistency, especially in assessing
extent to which the information is presented in an easy- subjective criteria.
to-understand manner, and Fluency, i.e., the degree to As for the models’ comparison, we first aggregated all
which a text ‘flows well’. The third dimension, Accuracy, ratings assigned in order to provide an overview of the
was evaluated based on the degree to which the content models’ output across all five evaluation criteria. Since
of an output is correct, accurate, and true relative to the the data is ordinal, we use the median value as an
aginput. The definitions of Understandability, Fluency, and gregation function to assess the central tendency of the
Accuracy were drawn from the overview proposed in ratings (as also suggested in Amidei et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]). The results,
Howcroft et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. For each of these five criteria, evalua- shown in Table 4, indicate medium to high ratings
overtors were asked to assign a rating within the proposed all across all models. To thus answer our first research
scale, guided by a specific question associated with each
criterion (see Table 2).
12We used the statistical package K-Alpha Calculator [17]: https:
//www.k-alpha.org/
question, we examined the overall medians for each eval- cally significant, and the comparisons were carried out
uation criterion. The values obtained show that they separately for each evaluation criterion. This
prelimiperform reasonably well despite the variability across nary test confirmed that the diferences observed are
the models. Concerning the dimension of informative- indeed significant, considering a standard threshold of
ness, ratings range from 4 to 6 in Usefulness and from 5 to  &lt; 0.05. However, the Kruskal-Wallis test does not
7 in Necessity, suggesting that further refinements might determine which models are significantly diferent from
be necessary to ensure that the energy feedback delivered each other. Therefore, we proceeded with pairwise
comis useful and complete. In terms of comprehensibility, the parisons using Dunn’s test. This test confirmed a
sigcorresponding criteria show that all models are capable nificant diference between LLaMAntino2 and the other
of generating responses that are easily understandable three models.
and fluent, which are both relevant factors that might
contribute to a more enjoyable user experience in view Table 5
of the possible integration of such models in a conver- P-values obtained with pairwise comparisons between
LLasational interface. Also as regards Accuracy, the energy MAntino2 and the remaining models, using Dunn’s test, and
feedback generated by the models is generally correct, adjusted using Bonferroni correction.
with only one exception (LLaMAntino2). This indicates Cerbero LLaMAntino3
that, overall, the models provide accurate and reliable
information, another important factor when users have Usefulness 5e-04 1e-08 7e-08
to make informed decisions based on that feedback. Necessity 3e-12 2e-03 4e-04
        </p>
        <p>To answer our second research question, we then con- Understandability 3e-07 1e-03 9e-08
sidered the overall diferences among the models. As FAlcuceunrcaycy 25ee--0146 31ee--0120 51ee--0029
also shown in Table 4, LLaMAntino2 quite consistently
received lower ratings, particularly for Usefulness and
Accuracy, while the other models received high ratings Table 5 shows the p-values obtained by comparing this
overall, suggesting that they might be considered com- model with the other three for each evaluation criterion.
parable. To inspect this further, we carried out some The remaining comparisons yielded p-values well above
statistical tests. We first used the Kruskal-Wallis test, a the 0.05 threshold, therefore the null hypothesis
cannon-parametric test suitable for ordinal data, to compare not be rejected for those cases. The other three models
the distributions of more than two independent groups. can thus be considered comparable based on the ratings
We used it to determine whether the diferences among assigned by the evaluators in our experiment.
the median values obtained for the models were
statistiZefiro</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Conclusions and Limitations</title>
      <p>13https://github.com/RSTLess-research/Fauno-Italian-LLM
aims to ensure that the core principles of the experiment
are flexible enough to be easily replicated or adapted for
This study provides an initial assessment of several Ital- a wider range of diferent domains.
ian language models’ ability to generate efective energy
feedback. The results indicate that while the models
generally perform well, particularly in terms of comprehensi- Acknowledgments
bility and accuracy, there is greater variability regarding
informativeness. Among the tested models, results show This work has been developed within the framework
that, except for LLaMAntino2-7B-UltraChat, the remain- of the project e.INS- Ecosystem of Innovation for Next
ing ones provide comparable performances. However, Generation Sardinia (cod. ECS 00000038) funded by the
it is important to highlight the limitations of this study. Italian Ministry for Research and Education (MUR)
unFirst, this is a small-scale study, as it involves a limited der the National Recovery and Resilience Plan (NRRP)
number of models and evaluators. Concerning the former - MISSION 4 COMPONENT 2, "From research to
busiissue, we also point out that the study was restricted to ness" INVESTMENT 1.5, "Creation and strengthening of
models available on HuggingFace, excluding potentially Ecosystems of innovation" and construction of
"Territorelevant models from external sources, such as Fauno13 rial R&amp;D Leaders". This work was also partially funded
and Camoscio [18]. A more systematic study should con- under the National Recovery and Resilience Plan (NRRP)
sider these models as well, in order to provide a more - Mission 4 Component 2 Investment 1.3, Project code
comprehensive evaluation over the Italian LLMs’ land- PE0000021, “Network 4 Energy Sustainable Transition–
scape. As for the pool of evaluators, it is important to note NEST”.
a significant bias in both their personal backgrounds and
demographics. All the judges have a background in com- References
puter science and varying degrees of familiarity with the
topics at hand. Furthermore, there is a gender imbalance
(1 female and 3 male judges) and a lack of age diversity,
as all four judges fall within the 24–30 age range. In light
of these considerations, a more systematic comparison as
the one envisioned above would benefit from a broader
and more diverse pool of evaluators. This would not
only increase the reliability of the comparison but also
provide a deeper understanding of potential correlations
between socio-demographic factors, prior knowledge of
technology and energy-related concepts, and the
difering perceptions of the evaluation criteria considered in
our study. Common approaches to address the lack of
human participants include the use of crowdsourcing
platforms, with a careful design of participation criteria
that would ensure a better gender and demographic
balance. Alternatively, a user study involving prospective
users of the conversational agent could be conducted;
this would ultimately enable to gather valuable insights
on the type of feedback expected by the target audience.</p>
      <p>Finally, an extended evaluation framework should also
include an analysis of the statistical power of the sample
size to ensure more robust conclusions.</p>
      <p>Despite these limitations, this work ofers a
preliminary overview and aims to pave the way for future
research on this aspect, also stressing the importance of
more standardized human evaluation practices. As a
matter of fact, the evaluation protocol we designed draws
heavily from methodologies recommended in more
general literature pertaining to human evaluation within
generation and summarization tasks. Our approach thus
- potenza prodotta ora dall’impianto fotovoltaico: 0.0 kW
- potenza utilizzata ora dalla casa: 0.36 kW
A.1. Example with response generated by - stato di carica attuale della batteria: 53%
LLaMAntino2 - status della batteria: in scarica
- potenza fornita dalla batteria: 0.32 kW
Prompt &lt;s&gt;[INST] «SYS» Sei un an assistente AI per - potenza acquistata dalla rete: 0.04 kW
la lingua italiana. Rispondi nella lingua usata per la do- - energia utilizzata dalla casa in tutta la giornata: 14.45
manda in modo chiaro, diretto e completo. Attieniti stret- kWh
tamente alle istruzioni fornite e riporta la tua risposta nel - energia totale prodotta dai pannelli in tutta la giornata:
formato richiesto, senza aggiungere ulteriori commenti 36.909 kWh
o spiegazioni. «/SYS» - energia acquistata dalla rete in tutta la giornata: 0.55</p>
      <p>Considera le seguenti informazioni relative kWh
all’andamento di un impianto fotovoltaico: - auto-consumo della giornata: 13.90 kWh
- potenza prodotta ora dall’impianto fotovoltaico: 0.0 kW - energia immessa in rete in tutta la giornata: 23.01 kWh
- potenza utilizzata ora dalla casa: 0.23 kW
- stato di carica attuale della batteria: 55% Riassumi le informazioni che ti ho appena fornito per
- status della batteria: in scarica rispondere alla seguente domanda: livello di carica della
- potenza immessa in rete: 0.15 kW batteria al momento&lt;/s&gt; &lt;|assistant|&gt;
- potenza fornita dalla batteria: 0.38 kW
- energia utilizzata dalla casa in tutta la giornata: 14.45 Response "La batteria ha un livello di carica del 53% al
kWh momento, che indica che ci sono circa 53% di carica nella
- energia totale prodotta dai pannelli in tutta la giornata: batteria."
36.909 kWh
- energia acquistata dalla rete in tutta la giornata: 0.55
kWh
- auto-consumo della giornata: 13.90 kWh
- energia immessa in rete in tutta la giornata: 23.01 kWh</p>
      <sec id="sec-2-1">
        <title>Evaluation</title>
        <p>Riassumi le informazioni che ti ho appena fornito
per rispondere alla seguente domanda: voglio avere un
resoconto della produzione energetica dei pannelli solari
[/INST]
Response "Mi dispiace, non posso fornire una risposta
perché la vostra domanda non è specifica. Per ottenere una
risposta precisa, è necessario fornire ulteriori informazioni."</p>
      </sec>
      <sec id="sec-2-2">
        <title>Evaluation</title>
        <p>Prompt &lt;|system|&gt; Sei un an assistente AI per la
lingua italiana. Rispondi nella lingua usata per la
domanda in modo chiaro, diretto e completo. Attieniti
strettamente alle istruzioni fornite e riporta la tua
risposta nel formato richiesto, senza aggiungere ulteriori
commenti o spiegazioni.&lt;/s&gt; &lt;|user|&gt; Considera le
seguenti informazioni relative all’andamento di un
impianto fotovoltaico:</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Albertarelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fraternali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Melenhorst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Novak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pasini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-E.</given-names>
            <surname>Rizzoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rottondi</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          <article-title>Survey on the Design of Gamified Systems for Energy and</article-title>
          Water Sustainability,
          <source>Games</source>
          <volume>9</volume>
          (
          <year>2018</year>
          ). doi:
          <volume>10</volume>
          .3390/g9030038.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Chalal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Medjdoub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bezai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zune</surname>
          </string-name>
          ,
          <source>Visualisation in Energy Eco-Feedback Systems: A Systematic Review of Good Practice, Renewable and Sustainable Energy Reviews</source>
          <volume>162</volume>
          (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          . 1016/j.rser.
          <year>2022</year>
          .
          <volume>112447</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Atzori</surname>
          </string-name>
          ,
          <article-title>Conversational Agents for Energy Awareness</article-title>
          and Eficiency: A Survey,
          <source>Electronics</source>
          <volume>13</volume>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .3390/ electronics13020401.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Trivino</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          Sanchez-Valdes,
          <article-title>Generation of Linguistic Advices for Saving Energy: Architecture</article-title>
          , in: A.
          <string-name>
            <surname>-H. Dediu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Magdalena</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          Martín-Vide (Eds.),
          <source>Theory and Practice of Natural Computing</source>
          , Springer International Publishing, Cham,
          <year>2015</year>
          , pp.
          <fpage>83</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Conde-Clemente</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Alonso</surname>
          </string-name>
          , G. Trivino,
          <article-title>Toward Automatic Generation of Linguistic Advice for Saving Energy at Home</article-title>
          ,
          <source>Soft Computing</source>
          <volume>22</volume>
          (
          <year>2018</year>
          )
          <fpage>345</fpage>
          -
          <lpage>359</lpage>
          . doi:
          <volume>10</volume>
          .1007/s00500-016-2430-5.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Martínez-Municio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rodríguez-Benítez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Castillo-Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Giralt-Muiña</surname>
          </string-name>
          , L. JiménezLinares, Linguistic Modeling and Synthesis of Heterogeneous Energy Consumption Time Series Sets:,
          <source>International Journal of Computational Intelligence Systems</source>
          <volume>12</volume>
          (
          <year>2018</year>
          )
          <article-title>259</article-title>
          . doi:
          <volume>10</volume>
          .2991/ijcis.
          <year>2018</year>
          .
          <volume>125905639</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Howcroft</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Belz</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Clinciu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Gkatzia</surname>
          </string-name>
          , THMS.
          <year>2022</year>
          .3184400.
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mahamood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mille</surname>
          </string-name>
          , E. Van Mil- [17]
          <string-name>
            <given-names>G.</given-names>
            <surname>Marzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Balzano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Marchiori</surname>
          </string-name>
          ,
          <string-name>
            <surname>K-alpha caltenburg</surname>
            , S. Santhanam,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Rieser</surname>
          </string-name>
          ,
          <article-title>Twenty Years of culator-krippendorf's alpha calculator: A userConfusion in Human Evaluation: NLG Needs Eval- friendly tool for computing krippendorf's alpha uation Sheets and Standardised Definitions, in: Pro- inter-rater reliability coeficient, MethodsX 12 ceedings of the 13th International Conference on (</article-title>
          <year>2024</year>
          )
          <article-title>102545</article-title>
          . doi:https://doi.org/10.1016/ Natural Language Generation, Association for Com- j.mex.
          <year>2023</year>
          .
          <volume>102545</volume>
          . putational Linguistics, Dublin, Ireland,
          <year>2020</year>
          , pp. [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Santilli</surname>
          </string-name>
          , E. Rodolà,
          <source>Camoscio: An Italian 169-182</source>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .inlg-
          <volume>1</volume>
          .
          <fpage>23</fpage>
          .
          <string-name>
            <surname>Instruction-tuned</surname>
            <given-names>LLaMA</given-names>
          </string-name>
          , in: F. Boschetti, G. E.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shimorina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Belz</surname>
          </string-name>
          , The Human Evaluation Lebani,
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          , N. Novielli (Eds.), Proceedings Datasheet:
          <article-title>A Template for Recording Details of of the 9th Italian Conference on Computational LinHuman Evaluation Experiments in NLP</article-title>
          , in: A. Belz, guistics, Venice, Italy, November 30 - December 2,
          <string-name>
            <given-names>M.</given-names>
            <surname>Popović</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Reiter</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Shimorina (Eds.),
          <source>Proceed- 2023</source>
          , volume
          <volume>3596</volume>
          <source>of CEUR Workshop Proceedings, ings of the 2nd Workshop on Human Evaluation of CEUR-WS.org</source>
          ,
          <year>2023</year>
          . NLP Systems (HumEval),
          <source>Association for Computational Linguistics</source>
          , Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>54</fpage>
          -
          <lpage>75</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .humeval-
          <volume>1</volume>
          .6.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Amidei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Piwek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Willis</surname>
          </string-name>
          ,
          <article-title>The Use of Rating and Likert Scales in Natural Language Generation Human Evaluation Tasks: A Review and some Recommendations</article-title>
          ,
          <source>in: Proceedings of the 12th International Conference on Natural Language Generation</source>
          , Association for Computational Linguistics, Tokyo, Japan,
          <year>2019</year>
          , pp.
          <fpage>397</fpage>
          -
          <lpage>402</lpage>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <fpage>W19</fpage>
          -8648.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>C. Van Der Lee</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Gatt</surname>
            ,
            <given-names>E. Van Miltenburg</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Krahmer</surname>
          </string-name>
          ,
          <article-title>Human evaluation of automatically generated text: Current trends and best practice guidelines</article-title>
          ,
          <source>Computer Speech &amp; Language</source>
          <volume>67</volume>
          (
          <year>2021</year>
          )
          <article-title>101151</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.csl.
          <year>2020</year>
          .
          <volume>101151</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Galatolo</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. G. C.</surname>
          </string-name>
          <article-title>A. Cimino, Cerbero-7B: A Leap Forward in Language-Specific LLMs Through Enhanced Chat Corpus Generation</article-title>
          and Evaluation,
          <year>2023</year>
          . arXiv:
          <volume>2311</volume>
          .
          <fpage>15698</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          , E. Musacchio,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Siciliani</surname>
          </string-name>
          , G. Fiameni, G. Semeraro,
          <source>LLaMAntino: LLaMA 2 Models for Efective Text Generation in Italian Language</source>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2312</volume>
          .
          <fpage>09993</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Semeraro, Advanced Natural-based interaction for the ITAlian language: LLaMAntino-3-</article-title>
          <string-name>
            <surname>ANITA</surname>
          </string-name>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2405</volume>
          .
          <fpage>07101</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bocklisch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Faulkner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pawlowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nichol</surname>
          </string-name>
          ,
          <article-title>Rasa: Open Source Language Understanding and Dialogue Management</article-title>
          ,
          <source>CoRR abs/1712</source>
          .05181 (
          <year>2017</year>
          ). arXiv:
          <volume>1712</volume>
          .
          <fpage>05181</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bunk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Varshneya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vlasov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nichol</surname>
          </string-name>
          ,
          <article-title>DIET: Lightweight Language Understanding for Dialogue Systems</article-title>
          , CoRR abs/
          <year>2004</year>
          .09936 (
          <year>2020</year>
          ). arXiv:
          <year>2004</year>
          .09936.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mazzei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Anselma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rapp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mana</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Hossain</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Patti</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Simeoni</surname>
          </string-name>
          , L. Longo,
          <article-title>Anticipating User Intentions in Customer Care Dialogue Systems</article-title>
          ,
          <source>IEEE Transactions on Human-Machine Systems</source>
          (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .1109/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>