<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Ptihsao,r.Italy ating the accountability of LLMs in text simplification
† These authors contributed equally. and on assessing the metrics employed to measure the
$ marco.russodivito@unimol.it (M. Russodivito); quality of LLM text simplicfiation [</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>AI vs. Human: Efectiveness of LLMs in Simplifying Italian Administrative Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Russodivito</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vittorio Ganfi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuliana Fiorentino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rocco Oliveto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Molise</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>12</volume>
      <issue>13</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This study investigates the efectiveness of Large Language Models (LLMs) in simplifying Italian administrative texts compared to human informants. This research evaluates the performance of several well-known LLMs, including GPT-3.5-Turbo, GPT-4, LLaMA 3, and Phi 3, in simplifying a corpus of Italian administrative documents (s-ItaIst), a representative corpus of Italian administrative texts. To accurately compare the simplification abilities of humans and LLMs, six parallel corpora of a subsection of ItaIst are collected. These parallel corpora were analyzed using both complexity and similarity metrics to assess the outcomes of LLMs and human participants. Our findings indicate that while LLMs perform comparably to humans in many aspects, there are notable diferences in structural and semantic changes. The results of our study underscore the potential and limitations of using AI for administrative text simplification, highlighting areas where LLMs need improvement to achieve human-level proficiency.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Automatic Text Simplification</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Italian Administrative language</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Due to the increasing popularity of generative
Artificial Intelligence (AI) language tools [
        <xref ref-type="bibr" rid="ref1">1, 2</xref>
        ], significant
attention has been devoted to the use of LLMs for text
simplification [ 3]. Several studies have addressed the
application of LLMs to simplify texts, particularly focusing
on administrative documents, including those in Italian
[4, 5, 6]. Italian administrative texts are often notably
complex and obscure [7, 8, 9], which restricts a large
segment of the population from fully accessing the content
produced by the Italian public administration [10, 11].
      </p>
      <p>This work aims to (a) evaluate the quality of automatic
text simplification performed by several well-known
LLMs, and (b) compare LLM-based simplification with
human-based simplification. To address these research
questions, the following procedures were undertaken:</p>
      <sec id="sec-1-1">
        <title>1. From an empirical perspective, a large corpus of</title>
        <p>Italian administrative texts was collected (i.e.,
ItaIst). A parallel simplified counterpart of the
corpus was created using diferent LLMs.
Additionally, a shorter version of the administrative
corpus was manually simplified by two
annotators.</p>
      </sec>
      <sec id="sec-1-2">
        <title>2. From an analytical perspective, several statistical</title>
        <p>analyses were conducted to measure the
semantic and complexity closeness between human and
LLM-generated data. The comparison of scores
for both LLM and human datasets highlights
significant diferences and similarities in manual and</p>
        <p>AI-driven simplification.</p>
        <p>The results concerning readability indexes (e.g., Gulpease)
and semantic and structural similarities (e.g., edit
distance) reveal that LLMs generally perform comparably
to human informants. However, AI-simplified texts are
slightly less similar to the original documents than those
generated by human simplifiers. LLMs tend to introduce
more changes in the simplified corpora than human
annotators. The empirical study indicates that texts simplified
by AI exhibit more structural and lexical dissimilarities
from the original documents than those simplified by
humans.</p>
        <p>Replication package. All the codes and data
are available on Figshare at https://figshare.com/s/
4d927fe648c6f1cb4227.
by comparing parallel corpora of simplified documents topics and text types of the main corpus. Table 1 provides
and adopting a qualitative interpretative approach [17]. a summary of the s-ItaIst.</p>
        <p>Other contributions have assessed the outputs of LLMs
in simplification tasks, particularly focusing on models Table 1
partially trained on Italian [18]. An overview of the main metrics of the s-ItaIst corpus.</p>
        <p>Our paper analyzes the diferences between LLM and
human simplification of Italian administrative texts, fol- #Mdeotcruicms ents Value8
lowing a quantitative approach. By examining these dif- # sentences 1,314
ferences, our study aims to highlight the similarities and # tokens 33,295
dissimilarities that emerge during the simplification of # types 5,622
administrative documents by humans and AI.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Study Design</title>
      <p>3.2. LLMs
Our study aims to analyze the efectiveness of modern To investigate both open-source and commercial
modLLMs in simplifying administrative text. To achieve this, els, the s-ItaIst corpus was simplified using four distinct
we address the following Research Question (RQ): commercial LLMs, namely GPT-3.5-Turbo [21] and GPT-4
[22] by OpenAI, LLaMA 3 [23] by Meta, and Phi 3 [23] by
How efective are AI systems at simplifying Microsoft. For open-source models, we used the LLaMA 3
administrative texts compared to humans? 8B2 and Phi 3 3.8B3 variants, both fine-tuned on large
Italian corpora. This selection explores models of
variThis question evaluates whether modern AI can achieve ous sizes while ensuring optimal performance for Italian
a level of quality comparable to human experts, our refer- tasks.
ences, by analyzing how well LLMs can reduce complex- A detailed prompt was formulated to instruct each
ity while preserving the original meaning of the texts. model to perform the simplification task properly,
avoid</p>
      <p>The study has been conducted on a sub-corpus of ItaIst, ing summary and applying state-of-the-art simplification
utilizing several LLMs to support the text simplification rules [9]. The full prompt can be found in Appendix B.
process. The OpenAI models were accessed via APIs4, while
the open-source models were hosted on an AWS EC2
3.1. Corpus G65 instance equipped with a single Nvidia L4 GPU with
24GB vRAM.</p>
      <p>The ItaIst corpus has been created as part of the
VerbACxSS research project. It was composed by linguists
and jurists to create a representative linguistic resource 3.3. Experimental Procedure
for contemporary administrative Italian [19, 20]. ItaIst To address our research question, we conducted an
emwas assembled by collecting recent oficial documents pirical study to compare automatic and manual
simplifrom local and regional public administration websites ifcations. Our study, illustrated in Figure 1, can be
sumof eight Italian regions (Basilicata, Calabria, Campania, marized in three main steps: (i) constructing a corpus of
Lazio, Lombardy, Molise, Tuscany, and Veneto) covering administrative documents (i.e., s-ItaIst), (ii) simplifying
topics such as garbage, healthcare, and public services. this corpus using four LLMs and two human annotators,
The corpus includes a variety of text types, such as Ten- and (iii) comparing the LLM-simplified corpora with the
ders Notices, Planning Acts, Services Charters. human-simplified corpora.</p>
      <p>The reliability of the corpus design was ensured by (a) It is worth noting that the s-ItaIst corpus was
subdilinguists, who checked the corpus represents administra- vided into small sections (2-6 sentences) to avoid
exceedtive Italian in terms of textual and diatopic features, and ing the context windows of the LLMs and to facilitate
(b) jurists, who selected and validated each document human informants during simplification 6.
included in ItaIst. The resulting corpus, comprising 208
documents, consists of around 2, 000, 000 tokens and 2https://huggingface.co/DeepMount00/Llama-3-8b-Ita (last seen
0745, 000 types1. More information about the ItaIst corpus 21-2024)
can be found in Appendix A. 3https://huggingface.co/e-palmisano/Phi3-ITA-mini-4K-instruct</p>
      <p>To make a fair comparison between humans and AI, a 4(hltatsptss:/e/eonpe0n7-a2i.1c-o2m02/4ap)i/ (last seen 07-21-2024)
sub-corpus of ItaIst (hereinafter, s-ItaIst) was extracted. 5https://aws.amazon.com/it/ec2/instance-types/g6/ (last seen
07-21The s-ItaIst sub-corpus was composed by selecting rep- 2024)
resentative documents from each region, balancing the 6s-ItaIst corpus was segmented into a total of 619 sections of text.
Each section, then, was assigned to human annotators and LLMs
1https://huggingface.co/datasets/VerbACxSS/ItaIst for simplification.</p>
      <p>Manual
simplification
Human1
Parallel
Corpus</p>
      <sec id="sec-2-1">
        <title>Completixy Metrics</title>
        <p>Gulpease Index
Flesch-Vacca Index</p>
        <p>NVdB (%)
Passive verbs (%)
s-ItaIst
Metrics
Extractions
Automatic
simplification</p>
      </sec>
      <sec id="sec-2-2">
        <title>Similarity Metrics</title>
        <p>Semantic Similarity (%)</p>
        <p>Edit Distance (%)
Human2
Parallel
Corpus</p>
        <p>GPT-4
Parallel
Corpus</p>
        <p>GPT-3.5-Turbo</p>
        <p>Parallel
Corpus</p>
        <p>LLAMA 3
Parallel
Corpus</p>
        <p>Phi 3
Parallel
Corpus
3.4. Metrics</p>
        <p>Human annotators with strong backgrounds in linguis- In literature several simplicity measures (for instance,
tics and deep knowledge about administrative text simpli- SAMSA [29], and SARI [30]) are employed, although
ifcation simplified the corpus following common simplifi- their results may vary depending on the level of analysis
cation rules identified in the literature [ 24, 25, 8, 9]. They examined and, of course, on the design of the metrics.
exploited a custom web application that (i) assigned sec- Therefore, SAMSA aims to measure structural
simplictions of the document to simplify and (ii) tracked the time ity through monitoring sentence splitting accuracy, and
they spent during such an activity. Similarly, each LLM SARI was developed to measure the simplicity
advanwas instructed to automatically simplify every document tage when just lexical paraphrasing was evaluated.
Furin the corpus one section at a time. thermore, some study shows that when calculated using</p>
        <p>This approach provided a comprehensive comparison multi-operation manual references, both a generic
metdataset of six distinct parallel corpora. We analyzed these ric like BLEU [31] and an operation-specific one like
data to compare human and automatic simplifications SARI have low associations with assessments of
overby extracting features such as complexity and similarity all simplicity[32]. Thus, to measure the readability of
metrics to measure the quality of the simplified texts and investigated corpora we selected
their relatedness to the original text. Furthermore, we
computed the Wilcoxon Signed-Rank Test [26] to
statistically evaluate the diference between LLMs and human
metrics and Clif’s Delta [27, 28] to provide a measure of
the efect size.
1. Flesch Vacca Index, Gulpease Index and READ-IT,
since they are advanced instruments designed
to investigate the degree of simplicity of Italian
texts, and
2. percentages of some lexical and structural
features (i.e., amount of most common lexical items
and active verb forms) increasing the readability
of texts.</p>
        <p>To assess the quality of the simplifications, we employed
both complexity and similarity metrics from the
literature. Complexity metrics compare the ease of the original
and simplified text, while similarity metrics measure the
distance between them. We implemented these metrics
according to the state-of-the-art, leveraging natural
language processing (NLP) techniques (e.g., tokenization,
POS tagging7).</p>
        <sec id="sec-2-2-1">
          <title>7The process of tokenization and tagging was conducted using the</title>
          <p>spaCy natural language processing tool: https://spacy.io (last seen
07-21-2024)</p>
          <p>Also for similarity metrics, computational literature
ofers several resources aiming to measure the structural
or semantic proximity of texts. Some of these operate at
the n-gram overlap (e.g., BLEU [31] and METEOR [33]),
while others consider other features. For this analysis,
we select Semantic Similarity to quantify the degree of
semantic closeness between corpora and Edit distance
to measure structural similarities between investigated
corpora.</p>
          <p>To support future research, we have made our metrics
implementation publicly available8.</p>
          <p>Details concerning considered complexity metrics
herein are shown:
• Gulpease Index [34]: This metric evaluates the
readability of an Italian text and assesses the
education level required to fully comprehend it. It is
calculated using the following formula:
opted for the latter approach, which leverages
cosine similarity between contextual embeddings
(obtained through sentence-transformers
and an open-source multilingual model10) to
evaluate similarity at the sentence level,
encapsulating the overall contextual meaning [42].
• Edit distance (%) [43]: This metric measures the
similarity between two strings based on the
number of single-character edits (insertions, deletions,
or substitutions) required to transform one text
into the other. A value close to zero indicates a
relatively minor diference between the two texts,
while a high value indicates significant
rephrasing.</p>
          <p>217 − 130 *  −  (2) 3.5. Threats to validity
• READ-IT [36]: The tool is the first advanced
readability evaluation instrument for Italian, com- We analyze the validity of our study by examining
conbining traditional raw text features with lexi- struct, internal, and external validity. This evaluation
cal, morpho-syntactic, and syntactic information. helps us understand the strengths and limitations of our
Four diferent readability models are included in methodology and the generalizability of our findings.
the tool: READ-IT BASE includes only raw fea- Construct validity: The two linguistic experts
intures, calculating sentence length (average num- volved in the manual simplification of the s-ItaIst
corber of words per sentence) and word length (av- pus may have produced divergent variants due to their
erage number of characters per word); READ-IT subjective approaches. Despite diferences in seniority,
LEXICAL combines raw (e.g., word length) and both experts have strong linguistic backgrounds (holding
lexical (e.g., Type/Token Ratio) features; READ- PhDs) and several years of experience. Nevertheless,
inIT SYNTACTIC employs raw text (e.g., sentence volving two human simplifiers allowed us to explore
dislength) and morpho-syntactic (e.g., average num- tinct simplification approaches and compare automatic
ber of clauses per sentence) properties; READ-IT simplification against two varied benchmarks.
GLOBAL includes all other features, combining Internal validity: The LLMs used for automatic text
raw text, lexical, morpho–syntactic and syntactic simplification, particularly those from HuggingFace, may
(e.g., the depth of the whole parse tree) features 9. have been trained on non-administrative texts,
poten• NVdB (%): "Il Nuovo vocabolario di base della lin- tially introducing issues in the simplified text. However,
gua italiana" [37] consists of fundamental and we relied on state-of-the-art models tested against several
commonly used words representing the essential benchmarks [44, 45, 46, 47]. Additionally, the embeddings
lexicon of the Italian language. The ease of a text for calculating Semantic Similarity were obtained through
can be roughly estimated by the number of words a multilingual model chosen for its high ranking on the
listed in the basic vocabulary [38]. MTEB leaderboard11, particularly for its performance in
• Passive (%): Overuse of passive voice can lead the STS22 benchmark (it) [48].</p>
          <p>to ambiguity and complexity, especially for read- External validity: Our study focuses on the
subers who may struggle with comprehension [24, corpus ItaIst, consisting of eight administrative
docu25, 9]. It is calculated by identifying verbs with ments. Although the number of documents is relatively
aux:pass occurring in the Dependency Parsing small, the corpus includes over 1, 000 sentences. Manual
Tree. simplification of the corpus took Human1 and Human2
15 and 23 hours respectively. Extending our study to the
entire ItaIst corpus would have been infeasible. However,
the documents of the ItaIst sub-corpus were not chosen
• Semantic Similarity (%) [39]: This metric mea- randomly; they were selected to represent the variety of
sures the distance between the semantic mean- administrative texts.
ings of two documents. It can be computed
exploiting relevant methodologies from the
literature, such as BERTscore[40] and SBERT [41]. We</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Details concerning considered similarity metrics herein are shown:</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Results and Discussion</title>
      <sec id="sec-3-1">
        <title>LEXICAL). To validate our outcomes, we performed the</title>
        <p>Wilcoxon Signed-Rank Test and calculated Clif’s Delta
A preliminary analysis of our results, summarized in efect size to analyze the diference between GPT-4 and
Table 2, reveals several significant similarities and difer- human metrics. By examining the results in Table 3, we
ences between the human and LLM datasets. For instance, can assert that:
the variation in the number of tokens is similar across
both human and LLM corpora, although LLMs generally GPT-4 simplifications can be comparable
increase the number of sentences more prominently than to human simplifications. GPT-4
simplifihuman annotators. cations are negligibly better for complexity</p>
        <p>Regarding complexity metrics, all the parallel corpora metrics, moderately worse for similarity,
(both human and LLM) exhibit a general increase in read- and largely rephrased compared to human
ability compared to the original texts. For example, the simplifications.
majority of the corpora improve the Gulpease Index read- The results of the Wilcoxon Signed-Rank Test and Clif’s
ability metric, shifting the dificulty level from very dif- Delta Efect Size for the other models, though not fully
ifcult to dificult for middle school reading levels [34] significant, are listed in Appendix C.
(except for Human1 and GPT-3.5-Turbo). Additionally, A brief extract taken from Original, Human1, Human2
complexity metrics vary similarly across both human and and GPT-4 parallel corpora, representing the same phrase
LLM groups, with diferences between manual and AI simplified by the two human annotators and GPT-4 is
simplifiers not significantly greater than those between shown below 12:
Human1 and Human2 or among GPT-3.5-Turbo, GPT-4,
LLaMA 3, and Phi 3. Original: fatturato minimo annuo, per</p>
        <p>The analysis of semantic and structural distance met- gli ultimi tre esercizi, pari o superiore al
rics from the original s-ItaIst shows more pronounced valore stimato del presente appalto
diferences between human and LLM datasets. In terms Human1: Guadagno in un anno
(fatof semantic similarity (Semantic Similarity), the Human1 turato minimo annuo) negli ultimi 3 anni
and Human2 corpora are closer to the original meaning di valore uguale o superiore al valore di
than the LLM-simplified corpora. These diferences are questo bando
even more pronounced when considering edit distance Human2: l’ammontare di fatture emesse
(Edit distance). The percentage of edit distance is higher annualmente, per gli ultimi tre anni, deve
in the LLM group, with each LLM corpus exceeding the essere pari o superiore al valore stimato
human ones by at least 10%. del presente appalto</p>
        <p>Higher degrees of Semantic Similarity and lower de- GPT-4: un fatturato annuo minimo, negli
grees of Edit distance in human corpora indicate that ultimi tre anni, uguale o maggiore al
valhuman annotators tend to make fewer changes to the ore stimato dell’appalto
original text compared to LLMs. 12A more extensive example of data regarding human and LLM</p>
        <p>As reported in Table 2, GPT-4 achieved the best re- simplifications collected in the parallel corpora designed for this
sults across the majority of metrics (except for READ-IT study can be found in Appendix D.</p>
        <p>Table 3 are critical in administrative texts.
Results of the Wilcoxon Signed-Rank Test and Clif’s Delta Despite this limitation, LLMs can serve as valuable
Efect Size performed on GPT-4, Human1, and Human2 metrics. support tools for text simplification, significantly
accelMetrics p-value Efect Size erating a process that typically requires hours of manual
Gulpease Index &lt; 0.0001 negligible ↗ work. By generating initial drafts, LLMs can reduce the
1 Flesch Vacca Index &lt; 0.0001 negligible ↗ workload of human experts, who would then review and
an NVdB 0.0108 negligible ↗ refine the AI-generated drafts, ensuring the preservation
um Passive 0.0004 negligible ↘ of the overall meaning and legal integrity of the text.
H READ-IT BASE &lt; 0.0001 small ↘ The results achieved in our study indicated that modern
READ-IT LEXICAL &lt; 0.0001 negligible ↗ LLMs can simplify administrative documents almost as
READ-IT SYNTACTIC &lt; 0.0001 small ↘ efectively as humans. However, the achieved findings
2an FSERNGeldEVuemiAsldtpcaBDdhenia-stVIsitTceaacnSGIcnciaLmedOeIinxlBadAreixLty &lt;&lt;&lt;&lt;&lt; 000000......000000000000900000211111 lsssnnammmeergggaaallellliilllggiibbllee ↘↘↗↗↗↗
itcgmnhoaodelnriidcsesueaseumtcxeteetastednhintnatsiotctiovLemetvLlheyaMealtunsshaiimaantnergpethlohinfieufedomtjthtuaefernuxistdlt..leiycFxTautch,lraittespheqacneuobrdiulvisenltaduogledfinntypotcrrceroeoesodupefulhrdcavreubiantelsoege-um Passive &lt; 0.0001 negligible ↘ matically simplified documents. A manual investigation
H READ-IT BASE 0.0292 negligible ↗ of our parallel corpus, supervised by expert jurists, may
READ-IT LEXICAL reveal important implications in this sensitive context.
READ-IT SYNTACTIC &lt; 0.0001 negligible ↘ Another promising direction for future research is to
READ-IT GLOBAL &lt; 0.0001 negligible ↘ investigate the impact of automatic simplification on text
Semantic Similarity &lt; 0.0001 medium ↘ comprehension. An additional empirical study could be
Edit distance &lt; 0.0001 large ↗ designed to evaluate whether automatically simplified
documents are easier to understand than their original
versions.</p>
        <p>In the above syntagmas, the similarities between the Additionally, it would be worthwhile to explore
difsimplifications are quite obvious: for example, the tech- ferent prompting strategies to further improve
simplinical term esercizio or the more ambiguous word pari are ifcation quality. For instance, few-shot prompting [ 50]
replaced by the more common lexical equivalents anno with some manually simplified gold samples could better
or uguale, respectively. align LLMs with human style.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion</title>
      <sec id="sec-4-1">
        <title>In this study, we investigated the automatic simplifica</title>
        <p>tion of Italian administrative documents. Our results
demonstrate that LLMs can efectively simplify these
texts, performing comparably to humans 13.</p>
        <p>Among the models examined, GPT-4 shows superior
performance in text simplification, exhibiting significant
improvements in complexity metrics. Nonetheless, it is
noteworthy that humans tend to maintain a higher level
of Edit distance and Semantic Similarity, ensuring the
preservation of the original meaning and structure of
the text. In other words, humans—aware of the
importance of precise language for these documents—mostly
preserved the original meaning and structure, whereas
LLMs, while simplifying, tended to rephrase extensively.
This rephrasing, although efective in reducing
complexity, might inadvertently alter the legal nuances, which
13Further evidence showing that LLM simplifications preserve the
meaning of the original texts was obtained in a study, conducted
on the same data. The unpublished research indicated that
experienced evaluators, i.e., jurists having administrative competence,
agree that LLM simplifications of administrative texts maintain
the legal integrity of the original documents [49].</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>This contribution is a result of the research conducted</title>
        <p>within the framework of the PRIN 2020 (Progetti di
Rilevante Interesse Nazionale) “VerbACxSS: on analytic verbs,
complexity, synthetic verbs, and simplification. For
accessibility” (Prot. 2020BJKB9M), funded by the Italian
Ministry of Universities and Research.</p>
        <p>Giuliana Fiorentino and Rocco Oliveto are responsible for
research question identification, study design, research
supervision and data analysis. However, for academic
reasons, Section 2, Section 3.1, Section 3.3, Section 4, and
Section 5 are attributed to Vittorio Ganfi; and Section 1,
Section 3, Section 3.2, Section 3.4 and Section 3.5 to Marco
Russodivito.
[2] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De- crosoft Bing, medRxiv (2023). doi:10.1101/2023.
langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- 06.04.23290786.
towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, [14] P. Mavrepis, G. Makridis, G. Fatouros, V. Koukos,
Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. M. Separdani, D. Kyriazis, Xai for all: Can large
M. Drame, Q. Lhoest, A. Rush, Transformers: State- language models simplify explainable ai?, arXiv
of-the-art natural language processing, in: Confer- preprint arXiv:2401.13110 (2024).
ence on Empirical Methods in Natural Language [15] Y. Ma, S. Seneviratne, E. Daskalaki, Improving Text
Processing: System Demonstrations (EMNLP), 2020, Simplification with Factuality Error Detection, in:
pp. 38–45. Workshop on Text Simplification, Accessibility, and
[3] M. J. Ryan, T. Naous, W. Xu, Revisiting non-English Readability (TSAR), 2022, pp. 173–178.
text simplification: A unified multilingual bench- [16] F. Alva-Manchego, C. Scarton, L. Specia,
Datamark, Association for Computational Linguistics Driven Sentence Simplification: Survey and
Bench(ACL) (2023). mark, Computational Linguistics 46 (2020) 135–187.
[4] D. Brunato, F. Dell’Orletta, G. Venturi, S. Monte- [17] M. Miliani, F. Alva-Manchego, A. Lenci, Simplifying
magni, Design and Annotation of the First Italian Administrative Texts for Italian L2 Readers with
Corpus for Text Simplification, in: Linguistic An- Controllable Transformers Models: A Data-driven
notation Workshop (LAW), 2015, pp. 31–41. Approach., in: CLiC-it, 2023.
[5] M. Miliani, S. Auriemma, F. Alva-Manchego, [18] D. Nozza, G. Attanasio, et al., Is it really that
simA. Lenci, Neural readability pairwise ranking for ple? prompting language models for automatic text
sentences in Italian administrative language, in: simplification in italian, in: CEUR Workshop
ProAsia-Pacific Chapter of the Association for Compu- ceedings, 2023.
tational Linguistics(AACL) and International Joint [19] D. Vellutino, et al., L’italiano istituzionale per la
Conference on Natural Language Processing (IJC- comunicazione pubblica, Il mulino, Bologna, 2018.</p>
        <p>NLP), 2022, pp. 849–866. [20] D. Vellutino, N. Cirillo, Corpus «itaist»: Note per
[6] M. Miliani, M. S. Senaldi, G. Lebani, A. Lenci, Un- lo sviluppo di una risorsa linguistica per lo studio
derstanding Italian Administrative Texts: A Reader- dell’italiano istituzionale per il diritto di accesso
Oriented Study for Readability Assessment and civico, Italiano LinguaDue 16 (2024) 238–250.
Text Simplification, in: Workshop on AI for Public [21] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D.
KaAdministration (AIxPA), 2022, pp. 71–87. plan, P. Dhariwal, A. Neelakantan, P. Shyam, G.
Sas[7] S. Lubello, La lingua del diritto e try, A. Askell, et al., Language models are few-shot
dell’amministrazione, Il mulino, Bologna, 2017. learners, Advances in Neural Information
Process[8] M. Cortelazzo, Il linguaggio amministrativo. Prin- ing Systems (NIPS) 33 (2020) 1877–1901.
cipi e pratiche di modernizzazione, Carocci, Roma, [22] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya,
2021. F. L. Aleman, D. Almeida, J. Altenschmidt, S.
Alt[9] G. Fiorentino, V. Ganfi, Parametri per semplificare man, S. Anadkat, et al., Gpt-4 technical report,
l’italiano istituzionale: Revisione della letteratura, arXiv preprint arXiv:2303.08774 (2023).</p>
        <p>Italiano LinguaDue 16 (2024) 220–237. [23] AI@Meta, Llama 3 model card (2024). URL:
[10] E. Piemontese (Ed.), Il dovere costituzionale di farsi https://github.com/meta-llama/llama3/blob/main/
capire. A trent’anni dal Codice di stile, Carocci, MODEL_CARD.md.</p>
        <p>Roma, 2023. [24] E. Piemontese, Criteri e proposte di semplificazione,
[11] S. Lubello, Da dembsher al codice di stile e oltre: un in: Codice di stile delle comunicazioni scritte a uso
bilancio sul linguaggio burocratico, in: E. Piemon- delle pubbliche amministrazioni, Istituto Poligrafico
tese (Ed.), Il dovere costituzionale di farsi capire A e Zecca dello Stato, Roma, 1994.
trent’anni dal Codice di stile, Carocci, Roma, 2023, [25] A. Fioritto, Manuale di stile. Strumenti per
semplifipp. 54–70. care il linguaggio delle amministrazioni pubbliche,
[12] G. Gonzalez Delgado, B. Navarro Colorado, The Il mulino, Bologna, 1997.</p>
        <p>Simplification of the Language of Public Adminis- [26] F. Wilcoxon, Probability tables for individual
comtration: The Case of Ombudsman Institutions, in: parisons by ranking methods, Biometrics 3 (1947)
Proceedings of the Workshop on DeTermIt! Evalu- 119–122.
ating Text Dificulty in a Multilingual Context, 2024, [27] N. Clif, Dominance statistics: Ordinal analyses to
pp. 125–133. answer ordinal questions., Psychological bulletin
[13] R. Doshi, K. Amin, P. Khosla, S. Bajaj, S. Chheang, 114 (1993) 494–509.</p>
        <p>H. P. Forman, Utilizing large Language Models to [28] N. Clif, Ordinal methods for behavioral data
analySimplify Radiology Reports: a comparative analysis sis, Psychology Press, New York, 2014.
of ChatGPT3.5, ChatGPT4.0, Google Bard, and Mi- [29] E. Sulem, O. Abend, A. Rappoport, Semantic
structural evaluation for text simplification, in: doi:10.3389/fpsyg.2022.707630.
M. Walker, H. Ji, A. Stent (Eds.), Proceedings of the [39] D. Chandrasekaran, V. Mago, Evolution of semantic
2018 Conference of the North American Chapter similarity—A survey, ACM Computing Surveys
of the Association for Computational Linguistics: (CSUR) 54 (2021) 1–37.</p>
        <p>Human Language Technologies, Volume 1 (Long [40] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger,
Papers), Association for Computational Linguistics, Y. Artzi, Bertscore: Evaluating text generation
New Orleans, Louisiana, 2018, pp. 685–696. URL: with bert, in: International Conference on
Learnhttps://aclanthology.org/N18-1063. doi:10.18653/ ing Representations, 2020. URL: https://openreview.
v1/N18-1063. net/forum?id=SkeHuCVFDr.
[30] W. Xu, C. Napoles, E. Pavlick, Q. Chen, C. Callison- [41] N. Reimers, I. Gurevych, Sentence-BERT: Sentence
Burch, Optimizing Statistical Machine Translation Embeddings using Siamese BERT-Networks, in:
for Text Simplification, Transactions of the As- Conference on Empirical Methods in Natural
Lansociation for Computational Linguistics 4 (2016) guage Processing (EMNLP), Association for
Com401–415. URL: https://doi.org/10.1162/tacl_a_00107. putational Linguistics, 2019.</p>
        <p>doi:10.1162/tacl_a_00107. [42] A. Barayan, J. Camacho-Collados, F.
Alva[31] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a Manchego, Analysing zero-shot
readabilitymethod for automatic evaluation of machine trans- controlled sentence simplification, arXiv preprint
lation, in: Proceedings of the 40th Annual Meet- arXiv:2409.20246 (2024).
ing on Association for Computational Linguistics, [43] F. P. Miller, A. F. Vandome, J. McBrewster,
LevACL ’02, Association for Computational Linguis- enshtein distance: Information theory, computer
tics, USA, 2002, p. 311–318. URL: https://doi.org/ science, string (computer science), string metric,
10.3115/1073083.1073135. doi:10.3115/1073083. damerau? Levenshtein distance, spell checker,
ham1073135. ming distance, Alpha Press, Olando, 2009.
[32] F. Alva-Manchego, C. Scarton, L. Specia, The [44] D. Hendrycks, C. Burns, S. Basart, A. Zou,
(Un)Suitability of Automatic Evaluation Metrics for M. Mazeika, D. Song, J. Steinhardt, Measuring
Text Simplification, Computational Linguistics 47 massive multitask language understanding,
Inter(2021) 861–889. URL: https://doi.org/10.1162/coli_ national Conference on Learning Representations
a_00418. doi:10.1162/coli_a_00418. (ICLR) (2021).
[33] S. Banerjee, A. Lavie, Meteor: An automatic metric [45] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi,
for mt evaluation with improved correlation with Hellaswag: Can a machine really finish your
senhuman judgments, in: Workshop on Intrinsic and tence?, in: Proceedings of the 57th Annual Meeting
Extrinsic Evaluation Measures for Machine Trans- of the Association for Computational Linguistics,
lation and/or Summarization, 2005, pp. 65–72. 2019, p. 4791–4800.
[34] P. Lucisano, M. E. Piemontese, Gulpease: una for- [46] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A.
Sabmula per la predizione della leggibilita di testi in harwal, C. Schoenick, O. Tafjord, Think you have
lingua italiana, Scuola e città (1988) 110–124. solved question answering? try arc, the ai2
rea[35] V. Franchina, R. Vacca, Adaptation of flesh readabil- soning challenge, arXiv preprint arXiv:1803.05457
ity index on a bilingual text written by the same (2018).
author both in italian and english languages, Lin- [47] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh,
guaggi 3 (1986) 47–49. M. Gardner, Drop: A reading comprehension
bench[36] F. Dell’Orletta, S. Montemagni, G. Venturi, Read–it: mark requiring discrete reasoning over paragraphs,
Assessing readability of italian texts with a view to in: J. Burstein, C. Doran, T. Solorio (Eds.),
Proceedtext simplification, in: Proceedings of the second ings of the 2019 Conference of the North American
workshop on speech and language processing for Chapter of the Association for Computational
Linassistive technologies, 2011, pp. 73–83. guistics: Human Language Technologies, Volume 1
[37] T. De Mauro, I. Chiari, Il nuovo vo- (Long and Short Papers), 2019, pp. 2368–2378.
cabolario di base della lingua italiana [48] N. Muennighof, N. Tazi, L. Magne, N. Reimers,
(2016). URL: https://www.internazionale. MTEB: Massive text embedding benchmark, in:
it/opinione/tullio-de-mauro/2016/12/23/ European Chapter of the Association for
Computail-nuovo-vocabolario-di-base-della-lingua-italiana. tional Linguistics (EACL), 2023, pp. 2014–2037.
[38] D. Brunato, F. Dell’Orletta, G. Venturi, [49] G. Fiorentino, M. Russodivito, V. Ganfi, R. Oliveto,
Linguistically-Based Comparison of Difer- Validazione e confronto tra semplificazione
autoent Approaches to Building Corpora for matica e semplificazione manuale di testi in italiano
Text Simplification: A Case Study on Ital- istituzionale ai fini dell’eficacia comunicativa, in:
ian, Frontiers in Psychology 13 (2022). Automated texts In the ROMance languages and
beadvances of few-shot learning methods and
applications, Science China Technological Sciences 66
(2023) 920–944.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>A. Corpus ItaIst</title>
      <p>The ItaIst corpus is a comprehensive collection of Italian
administrative documents. Table 4 provides an overview
of the topics and regions from which these documents
were collected. This corpus has been assembled to
represent the diversity and complexity of contemporary
administrative Italian, ensuring its relevance for linguistic
and computational analysis.</p>
    </sec>
    <sec id="sec-7">
      <title>B. Prompt engineering</title>
      <sec id="sec-7-1">
        <title>In the context of LLMs, the term prompt refers to the</title>
        <p>instructions provided to a language model to generate
a specific response. Prompt engineering is the process
of designing a clear and detailed prompt to instruct the
model to generate a desired response. The prompt we
used to ask the models to simplify administrative text is:</p>
        <p>Sei un dipendente pubblico che deve scrivere dei
documenti istituzionali italiani per renderli semplici e
comprensibili per i cittadini. Ti verrà fornito un documento
pubblico e il tuo compito sarà quello di riscriverlo appli- The Wilcoxon Signed-Rank Test and Clif’s Delta efect size
cando regole di semplificazione senza però modificare il were employed to evaluate the metrics of GPT-3.5-Turbo,
significato del documento originale. Ad esempio potresti LLaMA 3, and Phi 3 models in comparison to two human
rendere le frasi più brevi, eliminare le perifrasi, esplicitare simplifiers, labelled as Human1 and Human2. These
analsempre il soggetto, utilizzare parole più semplicii, trasfor- yses provide insights into the relative efectiveness of
mare i verbi passivi in verbi di forma attiva, spostare le AI-driven simplifications versus human eforts.
frasi parentetiche alla fine del periodo.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>D. Examples</title>
    </sec>
    <sec id="sec-9">
      <title>C. Tests</title>
      <p>Table 8 provides several examples of text simplification.</p>
      <p>Table 5, Table 6, and Table 7 report the results of the For each example, we present the original text alongside
statistical analyses conducted to compare the simplifica- its simplified versions. The values of the complexity and
tion performance of various LLMs against human experts. similarity metrics are reported for each text.</p>
      <p>l L’operatore di Polizia Locale, quindi, rappresenta un importante punto di riferimento per la collettività. Nell’ambito delle
ian sue funzioni, esso svolge i propri compiti in maniera autorevole, dando prova di preparazione professionale e sensibilità
r nel contatto relazionale. La sua attività, inoltre, è caratterizzata dal costante sforzo teso alla migliore interpretazione delle
g
i
O situazioni e delle problematiche incontrate, applicando un approccio educativo e orientato alla più adeguata risposta ai
bisogni della cittadinanza.</p>
      <p>Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%)
38 12 77 % 28 % -
1 La Polizia Locale è un punto di riferimento per i cittadini. La Polizia Locale ha autorevolezza, professionalità e sensibilità
an nel contatto con i cittadini. La Polizia Locale cerca sempre di interpretare al meglio situazioni e problematiche incontrate.
um La Polizia Locale si comporta in modo da educare e rispondere adeguatamente ai bisogni dei cittadini.
H Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%)
55 33 67 % 0 % 93 % 56 %
2 L’operatore di Polizia Locale, quindi, è un importante punto di riferimento per la collettività. Quando è in servizio, esso
an svolge i propri compiti in maniera autorevole. Dimostra preparazione professionale e sensibilità nel contatto con gli
um utenti. Cerca sempre di interpretare le situazioni e i problemi nel migliore dei modi. Applica un approccio educativo e
H cerca di adeguarsi ai bisogni della cittadinanza.</p>
      <p>Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%)
58 42 83 % 0 % 98 % 35 %
L’operatore di Polizia Locale è un punto di riferimento importante per la comunità. Svolge i suoi compiti con autorità,
-4 mostrando competenza professionale e sensibilità nelle relazioni. Inoltre, lavora sempre per capire al meglio le situazioni
T
PG e i problemi che incontra, usando un approccio educativo per rispondere adeguatamente ai bisogni dei cittadini.</p>
      <p>Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%)
48 32 84 % 0 % 97 % 48 %
o L’operatore di Polizia Locale è un punto di riferimento importante per la comunità. Nell’esercizio delle sue funzioni,
rb svolge i compiti con autorevolezza, dimostrando professionalità e sensibilità nei rapporti con le persone. Si impegna
uT costantemente a comprendere al meglio le situazioni e le problematiche afrontate, adottando un approccio educativo e
.-35 orientato a rispondere in modo adeguato ai bisogni dei cittadini.</p>
      <p>PT Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%)
G 45 27 78 % 0 % 98 % 45 %
3 L’operatore di Polizia Locale è un punto di riferimento importante per la comunità. Esegue i suoi compiti con autorità,
A professionalità e sensibilità nel rapporto con la gente.
aM La sua attività è caratterizzata dal costante impegno per comprendere meglio le situazioni e i problemi, e per rispondere
LL in modo educativo ai bisogni dei cittadini, con un approccio orientato alla loro assistenza.</p>
      <p>Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%)
50 37 85 % 28 % 96 % 54 %
L’operatore di Polizia Locale è un punto di riferimento importante per la comunità. Esegue i suoi compiti con autorità,
i3 professionalità e sensibilità nel trattare le persone. La sua attività è guidata dal desiderio di capire meglio le situazioni e
hP le problematiche, e di rispondere in modo appropriato ai bisogni dei cittadini, con un approccio educativo.</p>
      <p>Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%)
52 38 82 % 28 % 96 % 56 %</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems (NIPS)</source>
          , volume
          <volume>30</volume>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>