=Paper= {{Paper |id=Vol-3878/91_main_long |storemode=property |title=AI vs. Human: Effectiveness of LLMs in Simplifying Italian Administrative Documents |pdfUrl=https://ceur-ws.org/Vol-3878/91_main_long.pdf |volume=Vol-3878 |authors=Marco Russodivito,Vittorio Ganfi,Giuliana Fiorentino,Rocco Oliveto |dblpUrl=https://dblp.org/rec/conf/clic-it/RussodivitoGFO24 }} ==AI vs. Human: Effectiveness of LLMs in Simplifying Italian Administrative Documents== https://ceur-ws.org/Vol-3878/91_main_long.pdf

AI vs. Human: Effectiveness of LLMs in Simplifying Italian
Administrative Documents
Marco Russodivito1,† , Vittorio Ganfi1,*,† , Giuliana Fiorentino1 and Rocco Oliveto1
1
University of Molise, Italy

Abstract
This study investigates the effectiveness of Large Language Models (LLMs) in simplifying Italian administrative texts compared
to human informants. This research evaluates the performance of several well-known LLMs, including GPT-3.5-Turbo, GPT-4,
LLaMA 3, and Phi 3, in simplifying a corpus of Italian administrative documents (s-ItaIst), a representative corpus of Italian
administrative texts. To accurately compare the simplification abilities of humans and LLMs, six parallel corpora of a
subsection of ItaIst are collected. These parallel corpora were analyzed using both complexity and similarity metrics to assess
the outcomes of LLMs and human participants. Our findings indicate that while LLMs perform comparably to humans in
many aspects, there are notable differences in structural and semantic changes. The results of our study underscore the
potential and limitations of using AI for administrative text simplification, highlighting areas where LLMs need improvement
to achieve human-level proficiency.

Keywords
Automatic Text Simplification, Large Language Models, Italian Administrative language

1. Introduction 2. From an analytical perspective, several statistical
analyses were conducted to measure the seman-
Due to the increasing popularity of generative Artifi- tic and complexity closeness between human and
cial Intelligence (AI) language tools [1, 2], significant LLM-generated data. The comparison of scores
attention has been devoted to the use of LLMs for text for both LLM and human datasets highlights sig-
simplification [3]. Several studies have addressed the ap- nificant differences and similarities in manual and
plication of LLMs to simplify texts, particularly focusing AI-driven simplification.
on administrative documents, including those in Italian
[4, 5, 6]. Italian administrative texts are often notably The results concerning readability indexes (e.g., Gulpease)
complex and obscure [7, 8, 9], which restricts a large seg- and semantic and structural similarities (e.g., edit dis-
ment of the population from fully accessing the content tance) reveal that LLMs generally perform comparably
produced by the Italian public administration [10, 11]. to human informants. However, AI-simplified texts are
This work aims to (a) evaluate the quality of automatic slightly less similar to the original documents than those
text simplification performed by several well-known generated by human simplifiers. LLMs tend to introduce
LLMs, and (b) compare LLM-based simplification with more changes in the simplified corpora than human anno-
human-based simplification. To address these research tators. The empirical study indicates that texts simplified
questions, the following procedures were undertaken: by AI exhibit more structural and lexical dissimilarities
from the original documents than those simplified by
1. From an empirical perspective, a large corpus of humans.
Italian administrative texts was collected (i.e., Replication package. All the codes and data
ItaIst). A parallel simplified counterpart of the are available on Figshare at https://figshare.com/s/
corpus was created using different LLMs. Addi- 4d927fe648c6f1cb4227.
tionally, a shorter version of the administrative
corpus was manually simplified by two annota-
tors. 2. Related Work
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Several researchers have conducted research on evalu-
Dec 04 — 06, 2024, Pisa, Italy
*
Corresponding author.
ating the accountability of LLMs in text simplification
†
These authors contributed equally.
and on assessing the metrics employed to measure the
$ marco.russodivito@unimol.it (M. Russodivito); quality of LLM text simplification [12, 13, 14, 15, 16]. In
vittorio.ganfi@unimol.it (V. Ganfi); giuliana.fiorentino@unimol.it particular, numerous studies have focused on assessing
(G. Fiorentino); rocco.oliveto@unimol.it (R. Oliveto) the use of LLMs to simplify Italian administrative texts,
0009-0004-8860-1739 (M. Russodivito); 0000-0002-0892-7287 highlighting the potential of these models to enhance
(V. Ganfi); 0000-0002-0392-9056 (G. Fiorentino);
text readability. Some studies have specifically evalu-
0000-0002-7995-8582 (R. Oliveto)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License ated the readability of simplified administrative texts
Attribution 4.0 International (CC BY 4.0).

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
by comparing parallel corpora of simplified documents topics and text types of the main corpus. Table 1 provides
and adopting a qualitative interpretative approach [17]. a summary of the s-ItaIst.
Other contributions have assessed the outputs of LLMs
in simplification tasks, particularly focusing on models Table 1
partially trained on Italian [18]. An overview of the main metrics of the s-ItaIst corpus.
Our paper analyzes the differences between LLM and
Metrics Value
human simplification of Italian administrative texts, fol-
# documents 8
lowing a quantitative approach. By examining these dif- # sentences 1,314
ferences, our study aims to highlight the similarities and # tokens 33,295
dissimilarities that emerge during the simplification of # types 5,622
administrative documents by humans and AI.

3. Study Design 3.2. LLMs
Our study aims to analyze the effectiveness of modern To investigate both open-source and commercial mod-
LLMs in simplifying administrative text. To achieve this, els, the s-ItaIst corpus was simplified using four distinct
we address the following Research Question (RQ): commercial LLMs, namely GPT-3.5-Turbo [21] and GPT-4
[22] by OpenAI, LLaMA 3 [23] by Meta, and Phi 3 [23] by
How effective are AI systems at simplifying Microsoft. For open-source models, we used the LLaMA 3
administrative texts compared to humans? 8B2 and Phi 3 3.8B3 variants, both fine-tuned on large
Italian corpora. This selection explores models of vari-
This question evaluates whether modern AI can achieve
ous sizes while ensuring optimal performance for Italian
a level of quality comparable to human experts, our refer-
tasks.
ences, by analyzing how well LLMs can reduce complex-
A detailed prompt was formulated to instruct each
ity while preserving the original meaning of the texts.
model to perform the simplification task properly, avoid-
The study has been conducted on a sub-corpus of ItaIst,
ing summary and applying state-of-the-art simplification
utilizing several LLMs to support the text simplification
rules [9]. The full prompt can be found in Appendix B.
process.
The OpenAI models were accessed via APIs4 , while
the open-source models were hosted on an AWS EC2
3.1. Corpus G65 instance equipped with a single Nvidia L4 GPU with
24GB vRAM.
The ItaIst corpus has been created as part of the Ver-
bACxSS research project. It was composed by linguists
and jurists to create a representative linguistic resource 3.3. Experimental Procedure
for contemporary administrative Italian [19, 20]. ItaIst
To address our research question, we conducted an em-
was assembled by collecting recent official documents
pirical study to compare automatic and manual simpli-
from local and regional public administration websites
fications. Our study, illustrated in Figure 1, can be sum-
of eight Italian regions (Basilicata, Calabria, Campania,
marized in three main steps: (i) constructing a corpus of
Lazio, Lombardy, Molise, Tuscany, and Veneto) covering
administrative documents (i.e., s-ItaIst), (ii) simplifying
topics such as garbage, healthcare, and public services.
this corpus using four LLMs and two human annotators,
The corpus includes a variety of text types, such as Ten-
and (iii) comparing the LLM-simplified corpora with the
ders Notices, Planning Acts, Services Charters.
human-simplified corpora.
The reliability of the corpus design was ensured by (a)
It is worth noting that the s-ItaIst corpus was subdi-
linguists, who checked the corpus represents administra-
vided into small sections (2-6 sentences) to avoid exceed-
tive Italian in terms of textual and diatopic features, and
ing the context windows of the LLMs and to facilitate
(b) jurists, who selected and validated each document
human informants during simplification6 .
included in ItaIst. The resulting corpus, comprising 208
documents, consists of around 2, 000, 000 tokens and 2 https://huggingface.co/DeepMount00/Llama-3-8b-Ita (last seen 07-
45, 000 types1 . More information about the ItaIst corpus 21-2024)
3
can be found in Appendix A. https://huggingface.co/e-palmisano/Phi3-ITA-mini-4K-instruct
(last seen 07-21-2024)
To make a fair comparison between humans and AI, a 4 https://openai.com/api/ (last seen 07-21-2024)
sub-corpus of ItaIst (hereinafter, s-ItaIst) was extracted. 5 https://aws.amazon.com/it/ec2/instance-types/g6/ (last seen 07-21-
The s-ItaIst sub-corpus was composed by selecting rep- 2024)
6
resentative documents from each region, balancing the s-ItaIst corpus was segmented into a total of 619 sections of text.
Each section, then, was assigned to human annotators and LLMs
1
https://huggingface.co/datasets/VerbACxSS/ItaIst for simplification.
Manual Automatic
simplification simplification

s-ItaIst

Human1 Human2 GPT-4 GPT-3.5-Turbo LLAMA 3 Phi 3
Parallel Parallel Parallel Parallel Parallel Parallel
Corpus Corpus Corpus Corpus Corpus Corpus

Completixy Metrics Similarity Metrics
Gulpease Index Semantic Similarity (%)
Flesch-Vacca Index Edit Distance (%)

NVdB (%)
Passive verbs (%) Metrics
Extractions

Figure 1: Experimental design schema: The s-ItaIst corpus was simplified both automatically and manually by two humans
and four LLMs. The resulting parallel corpora were analyzed using complexity and similarity metrics.

Human annotators with strong backgrounds in linguis- In literature several simplicity measures (for instance,
tics and deep knowledge about administrative text simpli- SAMSA [29], and SARI [30]) are employed, although
fication simplified the corpus following common simplifi- their results may vary depending on the level of analysis
cation rules identified in the literature [24, 25, 8, 9]. They examined and, of course, on the design of the metrics.
exploited a custom web application that (i) assigned sec- Therefore, SAMSA aims to measure structural simplic-
tions of the document to simplify and (ii) tracked the time ity through monitoring sentence splitting accuracy, and
they spent during such an activity. Similarly, each LLM SARI was developed to measure the simplicity advan-
was instructed to automatically simplify every document tage when just lexical paraphrasing was evaluated. Fur-
in the corpus one section at a time. thermore, some study shows that when calculated using
This approach provided a comprehensive comparison multi-operation manual references, both a generic met-
dataset of six distinct parallel corpora. We analyzed these ric like BLEU [31] and an operation-specific one like
data to compare human and automatic simplifications SARI have low associations with assessments of over-
by extracting features such as complexity and similarity all simplicity[32]. Thus, to measure the readability of
metrics to measure the quality of the simplified texts and investigated corpora we selected
their relatedness to the original text. Furthermore, we
1. Flesch Vacca Index, Gulpease Index and READ-IT,
computed the Wilcoxon Signed-Rank Test [26] to statisti-
since they are advanced instruments designed
cally evaluate the difference between LLMs and human
to investigate the degree of simplicity of Italian
metrics and Cliff’s Delta [27, 28] to provide a measure of
texts, and
the effect size.
2. percentages of some lexical and structural fea-
tures (i.e., amount of most common lexical items
3.4. Metrics and active verb forms) increasing the readability
To assess the quality of the simplifications, we employed of texts.
both complexity and similarity metrics from the litera- Also for similarity metrics, computational literature
ture. Complexity metrics compare the ease of the original offers several resources aiming to measure the structural
and simplified text, while similarity metrics measure the or semantic proximity of texts. Some of these operate at
distance between them. We implemented these metrics the n-gram overlap (e.g., BLEU [31] and METEOR [33]),
according to the state-of-the-art, leveraging natural lan- while others consider other features. For this analysis,
guage processing (NLP) techniques (e.g., tokenization, we select Semantic Similarity to quantify the degree of
POS tagging7 ). semantic closeness between corpora and Edit distance
to measure structural similarities between investigated
7
The process of tokenization and tagging was conducted using the corpora.
spaCy natural language processing tool: https://spacy.io (last seen To support future research, we have made our metrics
07-21-2024)
implementation publicly available8 . opted for the latter approach, which leverages
Details concerning considered complexity metrics cosine similarity between contextual embeddings
herein are shown: (obtained through sentence-transformers
• Gulpease Index [34]: This metric evaluates the and an open-source multilingual model10 ) to eval-
readability of an Italian text and assesses the edu- uate similarity at the sentence level, encapsulat-
cation level required to fully comprehend it. It is ing the overall contextual meaning [42].
calculated using the following formula: • Edit distance (%) [43]: This metric measures the
similarity between two strings based on the num-
300 * (𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠) − 10 * (𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑠)
89 + ber of single-character edits (insertions, deletions,
𝑡𝑜𝑘𝑒𝑛𝑠
(1) or substitutions) required to transform one text
• Flesch Vacca Index [35]: This is an adaptation of into the other. A value close to zero indicates a
the original Flesch Reading Ease formula for eval- relatively minor difference between the two texts,
uating the readability of Italian texts, computed while a high value indicates significant rephras-
as follows: ing.
𝑠𝑦𝑙𝑙𝑎𝑏𝑙𝑒𝑠 𝑡𝑜𝑘𝑒𝑛𝑠
217 − 130 * − (2)
𝑡𝑜𝑘𝑒𝑛𝑠 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠 3.5. Threats to validity
• READ-IT [36]: The tool is the first advanced
We analyze the validity of our study by examining con-
readability evaluation instrument for Italian, com-
struct, internal, and external validity. This evaluation
bining traditional raw text features with lexi-
helps us understand the strengths and limitations of our
cal, morpho-syntactic, and syntactic information.
methodology and the generalizability of our findings.
Four different readability models are included in
Construct validity: The two linguistic experts in-
the tool: READ-IT BASE includes only raw fea-
volved in the manual simplification of the s-ItaIst cor-
tures, calculating sentence length (average num-
pus may have produced divergent variants due to their
ber of words per sentence) and word length (av-
subjective approaches. Despite differences in seniority,
erage number of characters per word); READ-IT
both experts have strong linguistic backgrounds (holding
LEXICAL combines raw (e.g., word length) and
PhDs) and several years of experience. Nevertheless, in-
lexical (e.g., Type/Token Ratio) features; READ-
volving two human simplifiers allowed us to explore dis-
IT SYNTACTIC employs raw text (e.g., sentence
tinct simplification approaches and compare automatic
length) and morpho-syntactic (e.g., average num-
simplification against two varied benchmarks.
ber of clauses per sentence) properties; READ-IT
Internal validity: The LLMs used for automatic text
GLOBAL includes all other features, combining
simplification, particularly those from HuggingFace, may
raw text, lexical, morpho–syntactic and syntactic
have been trained on non-administrative texts, poten-
(e.g., the depth of the whole parse tree) features 9 .
tially introducing issues in the simplified text. However,
• NVdB (%): "Il Nuovo vocabolario di base della lin-
we relied on state-of-the-art models tested against several
gua italiana" [37] consists of fundamental and
benchmarks [44, 45, 46, 47]. Additionally, the embeddings
commonly used words representing the essential
for calculating Semantic Similarity were obtained through
lexicon of the Italian language. The ease of a text
a multilingual model chosen for its high ranking on the
can be roughly estimated by the number of words
listed in the basic vocabulary [38]. MTEB leaderboard11 , particularly for its performance in
the STS22 benchmark (it) [48].
• Passive (%): Overuse of passive voice can lead
to ambiguity and complexity, especially for read-External validity: Our study focuses on the sub-
ers who may struggle with comprehension [24, corpus ItaIst, consisting of eight administrative docu-
ments. Although the number of documents is relatively
25, 9]. It is calculated by identifying verbs with
aux:pass occurring in the Dependency Parsing
small, the corpus includes over 1, 000 sentences. Manual
Tree. simplification of the corpus took Human1 and Human2
15 and 23 hours respectively. Extending our study to the
Details concerning considered similarity metrics entire ItaIst corpus would have been infeasible. However,
herein are shown: the documents of the ItaIst sub-corpus were not chosen
• Semantic Similarity (%) [39]: This metric mea- randomly; they were selected to represent the variety of
sures the distance between the semantic mean- administrative texts.
ings of two documents. It can be computed ex-
ploiting relevant methodologies from the litera- 10
https://huggingface.co/intfloat/multilingual-e5-base (last seen 07-
ture, such as BERTscore[40] and SBERT [41]. We
21-2024)
8 11
https://pypi.org/project/italian-ats-evaluator (last seen 07-21-2024) https://huggingface.co/spaces/mteb/leaderboard (last seen 07-21-
9
http://www.italianlp.it/demo/read-it (last seen 04-10-2024) 2024)
Table 2
Metrics evaluated across the original corpus and the human and LLM simplified corpora.
Original Human1 Human2 GPT-3.5-Turbo GPT-4 LLaMA 3 Phi 3
Tokens 33,295 34,135 29,755 30,032 31,722 36,035 36,056
Sentences 1,314 1,506 1,744 1,515 1,840 1,944 1,900
Tokens per Sentences 25.33 22.66 17.06 19.53 17.24 18.53 18.97
Sentences per Documents 164.25 188.25 218.00 189.37 230.00 243.00 237.50
Gulpease Index 44.31 49.72 50.64 48.49 51.34 50.26 50.16
Flesch Vacca Index 19.97 34.23 33.63 30.33 36.75 34.09 33.75
NVdB (%) 73.28 80.44 76.89 78.28 81.07 80.18 80.16
Passive (%) 20.87 15.78 17.71 13.99 12.00 15.81 15.72
READ-IT BASE (%) 75.91 68.62 51.00 66.61 55.00 58.37 57.69
READ-IT LEXICAL (%) 93.64 85.37 89.71 91.96 90.29 77.13 75.74
READ-IT SYNTACTIC (%) 63.72 53.14 40.09 38.42 29.92 40.97 41.24
READ-IT GLOBAL (%) 86.48 69.24 61.34 68.69 54.60 59.26 58.37
Semantic Similarity (%) - 96.52 97.26 96.06 95.80 94.96 94.96
Edit distance (%) - 35.84 29.20 49.21 52.14 55.48 55.44

4. Results and Discussion LEXICAL). To validate our outcomes, we performed the
Wilcoxon Signed-Rank Test and calculated Cliff’s Delta
A preliminary analysis of our results, summarized in effect size to analyze the difference between GPT-4 and
Table 2, reveals several significant similarities and differ- human metrics. By examining the results in Table 3, we
ences between the human and LLM datasets. For instance, can assert that:
the variation in the number of tokens is similar across
both human and LLM corpora, although LLMs generally GPT-4 simplifications can be comparable
increase the number of sentences more prominently than to human simplifications. GPT-4 simplifi-
human annotators. cations are negligibly better for complexity
Regarding complexity metrics, all the parallel corpora metrics, moderately worse for similarity,
(both human and LLM) exhibit a general increase in read- and largely rephrased compared to human
ability compared to the original texts. For example, the simplifications.
majority of the corpora improve the Gulpease Index read- The results of the Wilcoxon Signed-Rank Test and Cliff’s
ability metric, shifting the difficulty level from very dif- Delta Effect Size for the other models, though not fully
ficult to difficult for middle school reading levels [34] significant, are listed in Appendix C.
(except for Human1 and GPT-3.5-Turbo). Additionally, A brief extract taken from Original, Human1, Human2
complexity metrics vary similarly across both human and and GPT-4 parallel corpora, representing the same phrase
LLM groups, with differences between manual and AI simplified by the two human annotators and GPT-4 is
simplifiers not significantly greater than those between shown below 12 :
Human1 and Human2 or among GPT-3.5-Turbo, GPT-4,
LLaMA 3, and Phi 3. Original: fatturato minimo annuo, per
The analysis of semantic and structural distance met- gli ultimi tre esercizi, pari o superiore al
rics from the original s-ItaIst shows more pronounced valore stimato del presente appalto
differences between human and LLM datasets. In terms Human1: Guadagno in un anno (fat-
of semantic similarity (Semantic Similarity), the Human1 turato minimo annuo) negli ultimi 3 anni
and Human2 corpora are closer to the original meaning di valore uguale o superiore al valore di
than the LLM-simplified corpora. These differences are questo bando
even more pronounced when considering edit distance Human2: l’ammontare di fatture emesse
(Edit distance). The percentage of edit distance is higher annualmente, per gli ultimi tre anni, deve
in the LLM group, with each LLM corpus exceeding the essere pari o superiore al valore stimato
human ones by at least 10%. del presente appalto
Higher degrees of Semantic Similarity and lower de- GPT-4: un fatturato annuo minimo, negli
grees of Edit distance in human corpora indicate that ultimi tre anni, uguale o maggiore al val-
human annotators tend to make fewer changes to the ore stimato dell’appalto
original text compared to LLMs. 12
A more extensive example of data regarding human and LLM
As reported in Table 2, GPT-4 achieved the best re- simplifications collected in the parallel corpora designed for this
sults across the majority of metrics (except for READ-IT study can be found in Appendix D.
Table 3 are critical in administrative texts.
Results of the Wilcoxon Signed-Rank Test and Cliff’s Delta Despite this limitation, LLMs can serve as valuable
Effect Size performed on GPT-4, Human1, and Human2 metrics. support tools for text simplification, significantly accel-
Metrics p-value Effect Size erating a process that typically requires hours of manual
Gulpease Index < 0.0001 negligible ↗ work. By generating initial drafts, LLMs can reduce the
Flesch Vacca Index < 0.0001 negligible ↗ workload of human experts, who would then review and
Human1

NVdB 0.0108 negligible ↗ refine the AI-generated drafts, ensuring the preservation
Passive 0.0004 negligible ↘ of the overall meaning and legal integrity of the text.
READ-IT BASE < 0.0001 small ↘ The results achieved in our study indicated that modern
READ-IT LEXICAL < 0.0001 negligible ↗ LLMs can simplify administrative documents almost as
READ-IT SYNTACTIC < 0.0001 small ↘ effectively as humans. However, the achieved findings
READ-IT GLOBAL < 0.0001 small ↘
indicate that LLMs are not fully capable of preserving
Semantic Similarity < 0.0001 small ↘
Edit distance < 0.0001 large ↗ the semantic meaning of the text, tending to rephrase
Gulpease Index 0.0092 negligible ↗ more extensively than humans. This could introduce le-
Flesch Vacca Index < 0.0001 negligible ↗ gal issues into the simplified text. Further study could be
Human2

NVdB < 0.0001 small ↗ conducted to evaluate the juridical equivalence of auto-
Passive < 0.0001 negligible ↘ matically simplified documents. A manual investigation
READ-IT BASE 0.0292 negligible ↗ of our parallel corpus, supervised by expert jurists, may
READ-IT LEXICAL reveal important implications in this sensitive context.
READ-IT SYNTACTIC < 0.0001 negligible ↘ Another promising direction for future research is to
READ-IT GLOBAL < 0.0001 negligible ↘ investigate the impact of automatic simplification on text
Semantic Similarity < 0.0001 medium ↘
comprehension. An additional empirical study could be
Edit distance < 0.0001 large ↗
designed to evaluate whether automatically simplified
documents are easier to understand than their original
versions.
In the above syntagmas, the similarities between the Additionally, it would be worthwhile to explore dif-
simplifications are quite obvious: for example, the tech- ferent prompting strategies to further improve simpli-
nical term esercizio or the more ambiguous word pari are fication quality. For instance, few-shot prompting [50]
replaced by the more common lexical equivalents anno with some manually simplified gold samples could better
or uguale, respectively. align LLMs with human style.

5. Conclusion Acknowledgments
In this study, we investigated the automatic simplifica- This contribution is a result of the research conducted
tion of Italian administrative documents. Our results within the framework of the PRIN 2020 (Progetti di Rile-
demonstrate that LLMs can effectively simplify these vante Interesse Nazionale) “VerbACxSS: on analytic verbs,
texts, performing comparably to humans 13 . complexity, synthetic verbs, and simplification. For ac-
Among the models examined, GPT-4 shows superior cessibility” (Prot. 2020BJKB9M), funded by the Italian
performance in text simplification, exhibiting significant Ministry of Universities and Research.
improvements in complexity metrics. Nonetheless, it is Giuliana Fiorentino and Rocco Oliveto are responsible for
noteworthy that humans tend to maintain a higher level research question identification, study design, research
of Edit distance and Semantic Similarity, ensuring the supervision and data analysis. However, for academic
preservation of the original meaning and structure of reasons, Section 2, Section 3.1, Section 3.3, Section 4, and
the text. In other words, humans—aware of the impor- Section 5 are attributed to Vittorio Ganfi; and Section 1,
tance of precise language for these documents—mostly Section 3, Section 3.2, Section 3.4 and Section 3.5 to Marco
preserved the original meaning and structure, whereas Russodivito.
LLMs, while simplifying, tended to rephrase extensively.
This rephrasing, although effective in reducing complex-
ity, might inadvertently alter the legal nuances, which References
13
Further evidence showing that LLM simplifications preserve the [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
meaning of the original texts was obtained in a study, conducted L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, At-
on the same data. The unpublished research indicated that expe-
tention is all you need, in: Advances in Neural
rienced evaluators, i.e., jurists having administrative competence,
agree that LLM simplifications of administrative texts maintain Information Processing Systems (NIPS), volume 30,
the legal integrity of the original documents [49]. 2017.
[2] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De- crosoft Bing, medRxiv (2023). doi:10.1101/2023.
langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- 06.04.23290786.
towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, [14] P. Mavrepis, G. Makridis, G. Fatouros, V. Koukos,
Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. M. Separdani, D. Kyriazis, Xai for all: Can large
M. Drame, Q. Lhoest, A. Rush, Transformers: State- language models simplify explainable ai?, arXiv
of-the-art natural language processing, in: Confer- preprint arXiv:2401.13110 (2024).
ence on Empirical Methods in Natural Language [15] Y. Ma, S. Seneviratne, E. Daskalaki, Improving Text
Processing: System Demonstrations (EMNLP), 2020, Simplification with Factuality Error Detection, in:
pp. 38–45. Workshop on Text Simplification, Accessibility, and
[3] M. J. Ryan, T. Naous, W. Xu, Revisiting non-English Readability (TSAR), 2022, pp. 173–178.
text simplification: A unified multilingual bench- [16] F. Alva-Manchego, C. Scarton, L. Specia, Data-
mark, Association for Computational Linguistics Driven Sentence Simplification: Survey and Bench-
(ACL) (2023). mark, Computational Linguistics 46 (2020) 135–187.
[4] D. Brunato, F. Dell’Orletta, G. Venturi, S. Monte- [17] M. Miliani, F. Alva-Manchego, A. Lenci, Simplifying
magni, Design and Annotation of the First Italian Administrative Texts for Italian L2 Readers with
Corpus for Text Simplification, in: Linguistic An- Controllable Transformers Models: A Data-driven
notation Workshop (LAW), 2015, pp. 31–41. Approach., in: CLiC-it, 2023.
[5] M. Miliani, S. Auriemma, F. Alva-Manchego, [18] D. Nozza, G. Attanasio, et al., Is it really that sim-
A. Lenci, Neural readability pairwise ranking for ple? prompting language models for automatic text
sentences in Italian administrative language, in: simplification in italian, in: CEUR Workshop Pro-
Asia-Pacific Chapter of the Association for Compu- ceedings, 2023.
tational Linguistics(AACL) and International Joint [19] D. Vellutino, et al., L’italiano istituzionale per la
Conference on Natural Language Processing (IJC- comunicazione pubblica, Il mulino, Bologna, 2018.
NLP), 2022, pp. 849–866. [20] D. Vellutino, N. Cirillo, Corpus «itaist»: Note per
[6] M. Miliani, M. S. Senaldi, G. Lebani, A. Lenci, Un- lo sviluppo di una risorsa linguistica per lo studio
derstanding Italian Administrative Texts: A Reader- dell’italiano istituzionale per il diritto di accesso
Oriented Study for Readability Assessment and civico, Italiano LinguaDue 16 (2024) 238–250.
Text Simplification, in: Workshop on AI for Public [21] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka-
Administration (AIxPA), 2022, pp. 71–87. plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas-
[7] S. Lubello, La lingua del diritto e try, A. Askell, et al., Language models are few-shot
dell’amministrazione, Il mulino, Bologna, 2017. learners, Advances in Neural Information Process-
[8] M. Cortelazzo, Il linguaggio amministrativo. Prin- ing Systems (NIPS) 33 (2020) 1877–1901.
cipi e pratiche di modernizzazione, Carocci, Roma, [22] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya,
2021. F. L. Aleman, D. Almeida, J. Altenschmidt, S. Alt-
[9] G. Fiorentino, V. Ganfi, Parametri per semplificare man, S. Anadkat, et al., Gpt-4 technical report,
l’italiano istituzionale: Revisione della letteratura, arXiv preprint arXiv:2303.08774 (2023).
Italiano LinguaDue 16 (2024) 220–237. [23] AI@Meta, Llama 3 model card (2024). URL:
[10] E. Piemontese (Ed.), Il dovere costituzionale di farsi https://github.com/meta-llama/llama3/blob/main/
capire. A trent’anni dal Codice di stile, Carocci, MODEL_CARD.md.
Roma, 2023. [24] E. Piemontese, Criteri e proposte di semplificazione,
[11] S. Lubello, Da dembsher al codice di stile e oltre: un in: Codice di stile delle comunicazioni scritte a uso
bilancio sul linguaggio burocratico, in: E. Piemon- delle pubbliche amministrazioni, Istituto Poligrafico
tese (Ed.), Il dovere costituzionale di farsi capire A e Zecca dello Stato, Roma, 1994.
trent’anni dal Codice di stile, Carocci, Roma, 2023, [25] A. Fioritto, Manuale di stile. Strumenti per semplifi-
pp. 54–70. care il linguaggio delle amministrazioni pubbliche,
[12] G. Gonzalez Delgado, B. Navarro Colorado, The Il mulino, Bologna, 1997.
Simplification of the Language of Public Adminis- [26] F. Wilcoxon, Probability tables for individual com-
tration: The Case of Ombudsman Institutions, in: parisons by ranking methods, Biometrics 3 (1947)
Proceedings of the Workshop on DeTermIt! Evalu- 119–122.
ating Text Difficulty in a Multilingual Context, 2024, [27] N. Cliff, Dominance statistics: Ordinal analyses to
pp. 125–133. answer ordinal questions., Psychological bulletin
[13] R. Doshi, K. Amin, P. Khosla, S. Bajaj, S. Chheang, 114 (1993) 494–509.
H. P. Forman, Utilizing large Language Models to [28] N. Cliff, Ordinal methods for behavioral data analy-
Simplify Radiology Reports: a comparative analysis sis, Psychology Press, New York, 2014.
of ChatGPT3.5, ChatGPT4.0, Google Bard, and Mi- [29] E. Sulem, O. Abend, A. Rappoport, Semantic
structural evaluation for text simplification, in: doi:10.3389/fpsyg.2022.707630.
M. Walker, H. Ji, A. Stent (Eds.), Proceedings of the [39] D. Chandrasekaran, V. Mago, Evolution of semantic
2018 Conference of the North American Chapter similarity—A survey, ACM Computing Surveys
of the Association for Computational Linguistics: (CSUR) 54 (2021) 1–37.
Human Language Technologies, Volume 1 (Long [40] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger,
Papers), Association for Computational Linguistics, Y. Artzi, Bertscore: Evaluating text generation
New Orleans, Louisiana, 2018, pp. 685–696. URL: with bert, in: International Conference on Learn-
https://aclanthology.org/N18-1063. doi:10.18653/ ing Representations, 2020. URL: https://openreview.
v1/N18-1063. net/forum?id=SkeHuCVFDr.
[30] W. Xu, C. Napoles, E. Pavlick, Q. Chen, C. Callison- [41] N. Reimers, I. Gurevych, Sentence-BERT: Sentence
Burch, Optimizing Statistical Machine Translation Embeddings using Siamese BERT-Networks, in:
for Text Simplification, Transactions of the As- Conference on Empirical Methods in Natural Lan-
sociation for Computational Linguistics 4 (2016) guage Processing (EMNLP), Association for Com-
401–415. URL: https://doi.org/10.1162/tacl_a_00107. putational Linguistics, 2019.
doi:10.1162/tacl_a_00107. [42] A. Barayan, J. Camacho-Collados, F. Alva-
[31] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a Manchego, Analysing zero-shot readability-
method for automatic evaluation of machine trans- controlled sentence simplification, arXiv preprint
lation, in: Proceedings of the 40th Annual Meet- arXiv:2409.20246 (2024).
ing on Association for Computational Linguistics, [43] F. P. Miller, A. F. Vandome, J. McBrewster, Lev-
ACL ’02, Association for Computational Linguis- enshtein distance: Information theory, computer
tics, USA, 2002, p. 311–318. URL: https://doi.org/ science, string (computer science), string metric,
10.3115/1073083.1073135. doi:10.3115/1073083. damerau? Levenshtein distance, spell checker, ham-
1073135. ming distance, Alpha Press, Olando, 2009.
[32] F. Alva-Manchego, C. Scarton, L. Specia, The [44] D. Hendrycks, C. Burns, S. Basart, A. Zou,
(Un)Suitability of Automatic Evaluation Metrics for M. Mazeika, D. Song, J. Steinhardt, Measuring
Text Simplification, Computational Linguistics 47 massive multitask language understanding, Inter-
(2021) 861–889. URL: https://doi.org/10.1162/coli_ national Conference on Learning Representations
a_00418. doi:10.1162/coli_a_00418. (ICLR) (2021).
[33] S. Banerjee, A. Lavie, Meteor: An automatic metric [45] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi,
for mt evaluation with improved correlation with Hellaswag: Can a machine really finish your sen-
human judgments, in: Workshop on Intrinsic and tence?, in: Proceedings of the 57th Annual Meeting
Extrinsic Evaluation Measures for Machine Trans- of the Association for Computational Linguistics,
lation and/or Summarization, 2005, pp. 65–72. 2019, p. 4791–4800.
[34] P. Lucisano, M. E. Piemontese, Gulpease: una for- [46] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sab-
mula per la predizione della leggibilita di testi in harwal, C. Schoenick, O. Tafjord, Think you have
lingua italiana, Scuola e città (1988) 110–124. solved question answering? try arc, the ai2 rea-
[35] V. Franchina, R. Vacca, Adaptation of flesh readabil- soning challenge, arXiv preprint arXiv:1803.05457
ity index on a bilingual text written by the same (2018).
author both in italian and english languages, Lin- [47] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh,
guaggi 3 (1986) 47–49. M. Gardner, Drop: A reading comprehension bench-
[36] F. Dell’Orletta, S. Montemagni, G. Venturi, Read–it: mark requiring discrete reasoning over paragraphs,
Assessing readability of italian texts with a view to in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceed-
text simplification, in: Proceedings of the second ings of the 2019 Conference of the North American
workshop on speech and language processing for Chapter of the Association for Computational Lin-
assistive technologies, 2011, pp. 73–83. guistics: Human Language Technologies, Volume 1
[37] T. De Mauro, I. Chiari, Il nuovo vo- (Long and Short Papers), 2019, pp. 2368–2378.
cabolario di base della lingua italiana [48] N. Muennighoff, N. Tazi, L. Magne, N. Reimers,
(2016). URL: https://www.internazionale. MTEB: Massive text embedding benchmark, in:
it/opinione/tullio-de-mauro/2016/12/23/ European Chapter of the Association for Computa-
il-nuovo-vocabolario-di-base-della-lingua-italiana. tional Linguistics (EACL), 2023, pp. 2014–2037.
[38] D. Brunato, F. Dell’Orletta, G. Venturi, [49] G. Fiorentino, M. Russodivito, V. Ganfi, R. Oliveto,
Linguistically-Based Comparison of Differ- Validazione e confronto tra semplificazione auto-
ent Approaches to Building Corpora for matica e semplificazione manuale di testi in italiano
Text Simplification: A Case Study on Ital- istituzionale ai fini dell’efficacia comunicativa, in:
ian, Frontiers in Psychology 13 (2022). Automated texts In the ROMance languages and be-
yond” (AI-ROM-II), 2nd International Conference, advances of few-shot learning methods and appli-
To appear. cations, Science China Technological Sciences 66
[50] J. Wang, K. Liu, Y. Zhang, B. Leng, J. Lu, Recent (2023) 920–944.
Table 5 Table 7
Results of the Wilcoxon Signed-Rank Test and Cliff’s Delta Results of the Wilcoxon Signed-Rank Test and Cliff’s Delta
Effect Size performed on GPT-3.5-Turbo, Human1, and Human2 Effect Size performed on Phi 3, Human1, and Human2 metrics.
metrics.
Metrics p-value Effect Size
Metrics p-value Effect Size Gulpease Index 0.0134 negligible ↗
Gulpease Index < 0.0001 negligible ↘ Flesch Vacca Index

Human1
Flesch Vacca Index < 0.0001 negligible ↘ NVdB
Human1

NVdB < 0.0001 negligible ↘ Passive
Passive READ-IT BASE < 0.0001 small ↘
READ-IT BASE 0.0052 negligible ↘ READ-IT LEXICAL < 0.0001 negligible ↘
READ-IT LEXICAL < 0.0001 negligible ↗ READ-IT SYNTACTIC < 0.0001 small ↘
READ-IT SYNTACTIC < 0.0001 small ↘ READ-IT GLOBAL < 0.0001 small ↘
READ-IT GLOBAL Semantic Similarity < 0.0001 medium ↘
Semantic Similarity < 0.0001 small ↘ Edit distance < 0.0001 large ↗
Edit distance < 0.0001 medium ↗ Gulpease Index
Gulpease Index < 0.0001 small ↘ Flesch Vacca Index

Human2
Flesch Vacca Index < 0.0001 negligible ↘ NVdB < 0.0001 small ↗
Human2

NVdB < 0.0001 negligible ↗ Passive
Passive 0.0072 negligible ↘ READ-IT BASE < 0.0001 negligible ↗
READ-IT BASE < 0.0001 small ↗ READ-IT LEXICAL < 0.0001 small ↘
READ-IT LEXICAL 0.0091 negligible ↗ READ-IT SYNTACTIC
READ-IT SYNTACTIC READ-IT GLOBAL
READ-IT GLOBAL 0.0003 negligible ↗ Semantic Similarity < 0.0001 large ↘
Semantic Similarity < 0.0001 medium ↘ Edit distance < 0.0001 large ↗
Edit distance < 0.0001 large ↗

Table 6 A. Corpus ItaIst
Results of the Wilcoxon Signed-Rank Test and Cliff’s Delta
Effect Size performed on LLaMA 3, Human1, and Human2 The ItaIst corpus is a comprehensive collection of Italian
metrics. administrative documents. Table 4 provides an overview
Metrics p-value Effect Size of the topics and regions from which these documents
Gulpease Index 0.0077 negligible ↗ were collected. This corpus has been assembled to rep-
Flesch Vacca Index resent the diversity and complexity of contemporary ad-
Human1

NVdB ministrative Italian, ensuring its relevance for linguistic
Passive and computational analysis.
READ-IT BASE < 0.0001 small ↘
READ-IT LEXICAL < 0.0001 negligible ↘
READ-IT SYNTACTIC < 0.0001 small ↘ Table 4
READ-IT GLOBAL < 0.0001 small ↘ Topics and regions of documents collected in ItaIst
Semantic Similarity < 0.0001 medium ↘ Garbage Healthcare Public services
Edit distance < 0.0001 large ↗ Basilicata 8 3 9
Gulpease Index Calabria 11 5 9
Flesch Vacca Index Campania 14 7 9
Human2

NVdB < 0.0001 small ↗ Lazio 9 3 9
Passive Lombardia 15 3 11
READ-IT BASE < 0.0001 negligible ↗ Molise 10 7 9
READ-IT LEXICAL < 0.0001 small ↘ Toscana 19 4 12
READ-IT SYNTACTIC Veneto 9 5 10
READ-IT GLOBAL
Semantic Similarity < 0.0001 large ↘
Edit distance < 0.0001 large ↗
B. Prompt engineering
In the context of LLMs, the term prompt refers to the
instructions provided to a language model to generate
a specific response. Prompt engineering is the process
of designing a clear and detailed prompt to instruct the
model to generate a desired response. The prompt we
used to ask the models to simplify administrative text is:

Sei un dipendente pubblico che deve scrivere dei doc-
umenti istituzionali italiani per renderli semplici e com-
prensibili per i cittadini. Ti verrà fornito un documento
pubblico e il tuo compito sarà quello di riscriverlo appli- The Wilcoxon Signed-Rank Test and Cliff’s Delta effect size
cando regole di semplificazione senza però modificare il were employed to evaluate the metrics of GPT-3.5-Turbo,
significato del documento originale. Ad esempio potresti LLaMA 3, and Phi 3 models in comparison to two human
rendere le frasi più brevi, eliminare le perifrasi, esplicitare simplifiers, labelled as Human1 and Human2. These anal-
sempre il soggetto, utilizzare parole più semplicii, trasfor- yses provide insights into the relative effectiveness of
mare i verbi passivi in verbi di forma attiva, spostare le AI-driven simplifications versus human efforts.
frasi parentetiche alla fine del periodo.
D. Examples
C. Tests
Table 8 provides several examples of text simplification.
Table 5, Table 6, and Table 7 report the results of the For each example, we present the original text alongside
statistical analyses conducted to compare the simplifica- its simplified versions. The values of the complexity and
tion performance of various LLMs against human experts. similarity metrics are reported for each text.
Table 8
Examples of simplifications.
L’operatore di Polizia Locale, quindi, rappresenta un importante punto di riferimento per la collettività. Nell’ambito delle
Original

sue funzioni, esso svolge i propri compiti in maniera autorevole, dando prova di preparazione professionale e sensibilità
nel contatto relazionale. La sua attività, inoltre, è caratterizzata dal costante sforzo teso alla migliore interpretazione delle
situazioni e delle problematiche incontrate, applicando un approccio educativo e orientato alla più adeguata risposta ai
bisogni della cittadinanza.
Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%)
38 12 77 % 28 % - -
La Polizia Locale è un punto di riferimento per i cittadini. La Polizia Locale ha autorevolezza, professionalità e sensibilità
Human1

nel contatto con i cittadini. La Polizia Locale cerca sempre di interpretare al meglio situazioni e problematiche incontrate.
La Polizia Locale si comporta in modo da educare e rispondere adeguatamente ai bisogni dei cittadini.
Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%)
55 33 67 % 0% 93 % 56 %
L’operatore di Polizia Locale, quindi, è un importante punto di riferimento per la collettività. Quando è in servizio, esso
Human2

svolge i propri compiti in maniera autorevole. Dimostra preparazione professionale e sensibilità nel contatto con gli
utenti. Cerca sempre di interpretare le situazioni e i problemi nel migliore dei modi. Applica un approccio educativo e
cerca di adeguarsi ai bisogni della cittadinanza.
Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%)
58 42 83 % 0% 98 % 35 %
L’operatore di Polizia Locale è un punto di riferimento importante per la comunità. Svolge i suoi compiti con autorità,
GPT-4

mostrando competenza professionale e sensibilità nelle relazioni. Inoltre, lavora sempre per capire al meglio le situazioni
e i problemi che incontra, usando un approccio educativo per rispondere adeguatamente ai bisogni dei cittadini.
Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%)
48 32 84 % 0% 97 % 48 %
L’operatore di Polizia Locale è un punto di riferimento importante per la comunità. Nell’esercizio delle sue funzioni,
GPT-3.5-Turbo

svolge i compiti con autorevolezza, dimostrando professionalità e sensibilità nei rapporti con le persone. Si impegna
costantemente a comprendere al meglio le situazioni e le problematiche affrontate, adottando un approccio educativo e
orientato a rispondere in modo adeguato ai bisogni dei cittadini.
Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%)
45 27 78 % 0% 98 % 45 %
L’operatore di Polizia Locale è un punto di riferimento importante per la comunità. Esegue i suoi compiti con autorità,
LLaMA 3

professionalità e sensibilità nel rapporto con la gente.
La sua attività è caratterizzata dal costante impegno per comprendere meglio le situazioni e i problemi, e per rispondere
in modo educativo ai bisogni dei cittadini, con un approccio orientato alla loro assistenza.
Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%)
50 37 85 % 28 % 96 % 54 %
L’operatore di Polizia Locale è un punto di riferimento importante per la comunità. Esegue i suoi compiti con autorità,
Phi 3

professionalità e sensibilità nel trattare le persone. La sua attività è guidata dal desiderio di capire meglio le situazioni e
le problematiche, e di rispondere in modo appropriato ai bisogni dei cittadini, con un approccio educativo.
Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%)
52 38 82 % 28 % 96 % 56 %