1. Introduction

Ptihsao,r.Italy ating the accountability of LLMs in text simplification † These authors contributed equally. and on assessing the metrics employed to measure the $ marco.russodivito@unimol.it (M. Russodivito); quality of LLM text simplicfiation [

AI vs. Human: Efectiveness of LLMs in Simplifying Italian Administrative Documents

Marco Russodivito

Vittorio Ganfi

Giuliana Fiorentino

Rocco Oliveto

0 0 University of Molise , Italy

2024

12 13 0000 0002

This study investigates the efectiveness of Large Language Models (LLMs) in simplifying Italian administrative texts compared to human informants. This research evaluates the performance of several well-known LLMs, including GPT-3.5-Turbo, GPT-4, LLaMA 3, and Phi 3, in simplifying a corpus of Italian administrative documents (s-ItaIst), a representative corpus of Italian administrative texts. To accurately compare the simplification abilities of humans and LLMs, six parallel corpora of a subsection of ItaIst are collected. These parallel corpora were analyzed using both complexity and similarity metrics to assess the outcomes of LLMs and human participants. Our findings indicate that while LLMs perform comparably to humans in many aspects, there are notable diferences in structural and semantic changes. The results of our study underscore the potential and limitations of using AI for administrative text simplification, highlighting areas where LLMs need improvement to achieve human-level proficiency.

eol>Automatic Text Simplification Large Language Models Italian Administrative language

1. Introduction

Due to the increasing popularity of generative Artificial Intelligence (AI) language tools [ 1, 2 ], significant attention has been devoted to the use of LLMs for text simplification [ 3]. Several studies have addressed the application of LLMs to simplify texts, particularly focusing on administrative documents, including those in Italian [4, 5, 6]. Italian administrative texts are often notably complex and obscure [7, 8, 9], which restricts a large segment of the population from fully accessing the content produced by the Italian public administration [10, 11].

This work aims to (a) evaluate the quality of automatic text simplification performed by several well-known LLMs, and (b) compare LLM-based simplification with human-based simplification. To address these research questions, the following procedures were undertaken:

1. From an empirical perspective, a large corpus of

Italian administrative texts was collected (i.e., ItaIst). A parallel simplified counterpart of the corpus was created using diferent LLMs. Additionally, a shorter version of the administrative corpus was manually simplified by two annotators.

2. From an analytical perspective, several statistical

analyses were conducted to measure the semantic and complexity closeness between human and LLM-generated data. The comparison of scores for both LLM and human datasets highlights significant diferences and similarities in manual and

AI-driven simplification.

The results concerning readability indexes (e.g., Gulpease) and semantic and structural similarities (e.g., edit distance) reveal that LLMs generally perform comparably to human informants. However, AI-simplified texts are slightly less similar to the original documents than those generated by human simplifiers. LLMs tend to introduce more changes in the simplified corpora than human annotators. The empirical study indicates that texts simplified by AI exhibit more structural and lexical dissimilarities from the original documents than those simplified by humans.

Replication package. All the codes and data are available on Figshare at https://figshare.com/s/ 4d927fe648c6f1cb4227. by comparing parallel corpora of simplified documents topics and text types of the main corpus. Table 1 provides and adopting a qualitative interpretative approach [17]. a summary of the s-ItaIst.

Other contributions have assessed the outputs of LLMs in simplification tasks, particularly focusing on models Table 1 partially trained on Italian [18]. An overview of the main metrics of the s-ItaIst corpus.

Our paper analyzes the diferences between LLM and human simplification of Italian administrative texts, fol- #Mdeotcruicms ents Value8 lowing a quantitative approach. By examining these dif- # sentences 1,314 ferences, our study aims to highlight the similarities and # tokens 33,295 dissimilarities that emerge during the simplification of # types 5,622 administrative documents by humans and AI.

3. Study Design

3.2. LLMs Our study aims to analyze the efectiveness of modern To investigate both open-source and commercial modLLMs in simplifying administrative text. To achieve this, els, the s-ItaIst corpus was simplified using four distinct we address the following Research Question (RQ): commercial LLMs, namely GPT-3.5-Turbo [21] and GPT-4 [22] by OpenAI, LLaMA 3 [23] by Meta, and Phi 3 [23] by How efective are AI systems at simplifying Microsoft. For open-source models, we used the LLaMA 3 administrative texts compared to humans? 8B2 and Phi 3 3.8B3 variants, both fine-tuned on large Italian corpora. This selection explores models of variThis question evaluates whether modern AI can achieve ous sizes while ensuring optimal performance for Italian a level of quality comparable to human experts, our refer- tasks. ences, by analyzing how well LLMs can reduce complex- A detailed prompt was formulated to instruct each ity while preserving the original meaning of the texts. model to perform the simplification task properly, avoid

The study has been conducted on a sub-corpus of ItaIst, ing summary and applying state-of-the-art simplification utilizing several LLMs to support the text simplification rules [9]. The full prompt can be found in Appendix B. process. The OpenAI models were accessed via APIs4, while the open-source models were hosted on an AWS EC2 3.1. Corpus G65 instance equipped with a single Nvidia L4 GPU with 24GB vRAM.

The ItaIst corpus has been created as part of the VerbACxSS research project. It was composed by linguists and jurists to create a representative linguistic resource 3.3. Experimental Procedure for contemporary administrative Italian [19, 20]. ItaIst To address our research question, we conducted an emwas assembled by collecting recent oficial documents pirical study to compare automatic and manual simplifrom local and regional public administration websites ifcations. Our study, illustrated in Figure 1, can be sumof eight Italian regions (Basilicata, Calabria, Campania, marized in three main steps: (i) constructing a corpus of Lazio, Lombardy, Molise, Tuscany, and Veneto) covering administrative documents (i.e., s-ItaIst), (ii) simplifying topics such as garbage, healthcare, and public services. this corpus using four LLMs and two human annotators, The corpus includes a variety of text types, such as Ten- and (iii) comparing the LLM-simplified corpora with the ders Notices, Planning Acts, Services Charters. human-simplified corpora.

The reliability of the corpus design was ensured by (a) It is worth noting that the s-ItaIst corpus was subdilinguists, who checked the corpus represents administra- vided into small sections (2-6 sentences) to avoid exceedtive Italian in terms of textual and diatopic features, and ing the context windows of the LLMs and to facilitate (b) jurists, who selected and validated each document human informants during simplification 6. included in ItaIst. The resulting corpus, comprising 208 documents, consists of around 2, 000, 000 tokens and 2https://huggingface.co/DeepMount00/Llama-3-8b-Ita (last seen 0745, 000 types1. More information about the ItaIst corpus 21-2024) can be found in Appendix A. 3https://huggingface.co/e-palmisano/Phi3-ITA-mini-4K-instruct

To make a fair comparison between humans and AI, a 4(hltatsptss:/e/eonpe0n7-a2i.1c-o2m02/4ap)i/ (last seen 07-21-2024) sub-corpus of ItaIst (hereinafter, s-ItaIst) was extracted. 5https://aws.amazon.com/it/ec2/instance-types/g6/ (last seen 07-21The s-ItaIst sub-corpus was composed by selecting rep- 2024) resentative documents from each region, balancing the 6s-ItaIst corpus was segmented into a total of 619 sections of text. Each section, then, was assigned to human annotators and LLMs 1https://huggingface.co/datasets/VerbACxSS/ItaIst for simplification.

Manual simplification Human1 Parallel Corpus

Completixy Metrics

Gulpease Index Flesch-Vacca Index

NVdB (%) Passive verbs (%) s-ItaIst Metrics Extractions Automatic simplification

Similarity Metrics

Semantic Similarity (%)

Edit Distance (%) Human2 Parallel Corpus

GPT-4 Parallel Corpus

GPT-3.5-Turbo

Parallel Corpus

LLAMA 3 Parallel Corpus

Phi 3 Parallel Corpus 3.4. Metrics

Human annotators with strong backgrounds in linguis- In literature several simplicity measures (for instance, tics and deep knowledge about administrative text simpli- SAMSA [29], and SARI [30]) are employed, although ifcation simplified the corpus following common simplifi- their results may vary depending on the level of analysis cation rules identified in the literature [ 24, 25, 8, 9]. They examined and, of course, on the design of the metrics. exploited a custom web application that (i) assigned sec- Therefore, SAMSA aims to measure structural simplictions of the document to simplify and (ii) tracked the time ity through monitoring sentence splitting accuracy, and they spent during such an activity. Similarly, each LLM SARI was developed to measure the simplicity advanwas instructed to automatically simplify every document tage when just lexical paraphrasing was evaluated. Furin the corpus one section at a time. thermore, some study shows that when calculated using

This approach provided a comprehensive comparison multi-operation manual references, both a generic metdataset of six distinct parallel corpora. We analyzed these ric like BLEU [31] and an operation-specific one like data to compare human and automatic simplifications SARI have low associations with assessments of overby extracting features such as complexity and similarity all simplicity[32]. Thus, to measure the readability of metrics to measure the quality of the simplified texts and investigated corpora we selected their relatedness to the original text. Furthermore, we computed the Wilcoxon Signed-Rank Test [26] to statistically evaluate the diference between LLMs and human metrics and Clif’s Delta [27, 28] to provide a measure of the efect size. 1. Flesch Vacca Index, Gulpease Index and READ-IT, since they are advanced instruments designed to investigate the degree of simplicity of Italian texts, and 2. percentages of some lexical and structural features (i.e., amount of most common lexical items and active verb forms) increasing the readability of texts.

To assess the quality of the simplifications, we employed both complexity and similarity metrics from the literature. Complexity metrics compare the ease of the original and simplified text, while similarity metrics measure the distance between them. We implemented these metrics according to the state-of-the-art, leveraging natural language processing (NLP) techniques (e.g., tokenization, POS tagging7).

7The process of tokenization and tagging was conducted using the

spaCy natural language processing tool: https://spacy.io (last seen 07-21-2024)

Also for similarity metrics, computational literature ofers several resources aiming to measure the structural or semantic proximity of texts. Some of these operate at the n-gram overlap (e.g., BLEU [31] and METEOR [33]), while others consider other features. For this analysis, we select Semantic Similarity to quantify the degree of semantic closeness between corpora and Edit distance to measure structural similarities between investigated corpora.

To support future research, we have made our metrics implementation publicly available8.

Details concerning considered complexity metrics herein are shown: • Gulpease Index [34]: This metric evaluates the readability of an Italian text and assesses the education level required to fully comprehend it. It is calculated using the following formula: opted for the latter approach, which leverages cosine similarity between contextual embeddings (obtained through sentence-transformers and an open-source multilingual model10) to evaluate similarity at the sentence level, encapsulating the overall contextual meaning [42]. • Edit distance (%) [43]: This metric measures the similarity between two strings based on the number of single-character edits (insertions, deletions, or substitutions) required to transform one text into the other. A value close to zero indicates a relatively minor diference between the two texts, while a high value indicates significant rephrasing.

217 − 130 * − (2) 3.5. Threats to validity • READ-IT [36]: The tool is the first advanced readability evaluation instrument for Italian, com- We analyze the validity of our study by examining conbining traditional raw text features with lexi- struct, internal, and external validity. This evaluation cal, morpho-syntactic, and syntactic information. helps us understand the strengths and limitations of our Four diferent readability models are included in methodology and the generalizability of our findings. the tool: READ-IT BASE includes only raw fea- Construct validity: The two linguistic experts intures, calculating sentence length (average num- volved in the manual simplification of the s-ItaIst corber of words per sentence) and word length (av- pus may have produced divergent variants due to their erage number of characters per word); READ-IT subjective approaches. Despite diferences in seniority, LEXICAL combines raw (e.g., word length) and both experts have strong linguistic backgrounds (holding lexical (e.g., Type/Token Ratio) features; READ- PhDs) and several years of experience. Nevertheless, inIT SYNTACTIC employs raw text (e.g., sentence volving two human simplifiers allowed us to explore dislength) and morpho-syntactic (e.g., average num- tinct simplification approaches and compare automatic ber of clauses per sentence) properties; READ-IT simplification against two varied benchmarks. GLOBAL includes all other features, combining Internal validity: The LLMs used for automatic text raw text, lexical, morpho–syntactic and syntactic simplification, particularly those from HuggingFace, may (e.g., the depth of the whole parse tree) features 9. have been trained on non-administrative texts, poten• NVdB (%): "Il Nuovo vocabolario di base della lin- tially introducing issues in the simplified text. However, gua italiana" [37] consists of fundamental and we relied on state-of-the-art models tested against several commonly used words representing the essential benchmarks [44, 45, 46, 47]. Additionally, the embeddings lexicon of the Italian language. The ease of a text for calculating Semantic Similarity were obtained through can be roughly estimated by the number of words a multilingual model chosen for its high ranking on the listed in the basic vocabulary [38]. MTEB leaderboard11, particularly for its performance in • Passive (%): Overuse of passive voice can lead the STS22 benchmark (it) [48].

to ambiguity and complexity, especially for read- External validity: Our study focuses on the subers who may struggle with comprehension [24, corpus ItaIst, consisting of eight administrative docu25, 9]. It is calculated by identifying verbs with ments. Although the number of documents is relatively aux:pass occurring in the Dependency Parsing small, the corpus includes over 1, 000 sentences. Manual Tree. simplification of the corpus took Human1 and Human2 15 and 23 hours respectively. Extending our study to the entire ItaIst corpus would have been infeasible. However, the documents of the ItaIst sub-corpus were not chosen • Semantic Similarity (%) [39]: This metric mea- randomly; they were selected to represent the variety of sures the distance between the semantic mean- administrative texts. ings of two documents. It can be computed exploiting relevant methodologies from the literature, such as BERTscore[40] and SBERT [41]. We

Details concerning considered similarity metrics herein are shown: 4. Results and Discussion LEXICAL). To validate our outcomes, we performed the

Wilcoxon Signed-Rank Test and calculated Clif’s Delta A preliminary analysis of our results, summarized in efect size to analyze the diference between GPT-4 and Table 2, reveals several significant similarities and difer- human metrics. By examining the results in Table 3, we ences between the human and LLM datasets. For instance, can assert that: the variation in the number of tokens is similar across both human and LLM corpora, although LLMs generally GPT-4 simplifications can be comparable increase the number of sentences more prominently than to human simplifications. GPT-4 simplifihuman annotators. cations are negligibly better for complexity

Regarding complexity metrics, all the parallel corpora metrics, moderately worse for similarity, (both human and LLM) exhibit a general increase in read- and largely rephrased compared to human ability compared to the original texts. For example, the simplifications. majority of the corpora improve the Gulpease Index read- The results of the Wilcoxon Signed-Rank Test and Clif’s ability metric, shifting the dificulty level from very dif- Delta Efect Size for the other models, though not fully ifcult to dificult for middle school reading levels [34] significant, are listed in Appendix C. (except for Human1 and GPT-3.5-Turbo). Additionally, A brief extract taken from Original, Human1, Human2 complexity metrics vary similarly across both human and and GPT-4 parallel corpora, representing the same phrase LLM groups, with diferences between manual and AI simplified by the two human annotators and GPT-4 is simplifiers not significantly greater than those between shown below 12: Human1 and Human2 or among GPT-3.5-Turbo, GPT-4, LLaMA 3, and Phi 3. Original: fatturato minimo annuo, per

The analysis of semantic and structural distance met- gli ultimi tre esercizi, pari o superiore al rics from the original s-ItaIst shows more pronounced valore stimato del presente appalto diferences between human and LLM datasets. In terms Human1: Guadagno in un anno (fatof semantic similarity (Semantic Similarity), the Human1 turato minimo annuo) negli ultimi 3 anni and Human2 corpora are closer to the original meaning di valore uguale o superiore al valore di than the LLM-simplified corpora. These diferences are questo bando even more pronounced when considering edit distance Human2: l’ammontare di fatture emesse (Edit distance). The percentage of edit distance is higher annualmente, per gli ultimi tre anni, deve in the LLM group, with each LLM corpus exceeding the essere pari o superiore al valore stimato human ones by at least 10%. del presente appalto

Higher degrees of Semantic Similarity and lower de- GPT-4: un fatturato annuo minimo, negli grees of Edit distance in human corpora indicate that ultimi tre anni, uguale o maggiore al valhuman annotators tend to make fewer changes to the ore stimato dell’appalto original text compared to LLMs. 12A more extensive example of data regarding human and LLM

As reported in Table 2, GPT-4 achieved the best re- simplifications collected in the parallel corpora designed for this sults across the majority of metrics (except for READ-IT study can be found in Appendix D.

Table 3 are critical in administrative texts. Results of the Wilcoxon Signed-Rank Test and Clif’s Delta Despite this limitation, LLMs can serve as valuable Efect Size performed on GPT-4, Human1, and Human2 metrics. support tools for text simplification, significantly accelMetrics p-value Efect Size erating a process that typically requires hours of manual Gulpease Index < 0.0001 negligible ↗ work. By generating initial drafts, LLMs can reduce the 1 Flesch Vacca Index < 0.0001 negligible ↗ workload of human experts, who would then review and an NVdB 0.0108 negligible ↗ refine the AI-generated drafts, ensuring the preservation um Passive 0.0004 negligible ↘ of the overall meaning and legal integrity of the text. H READ-IT BASE < 0.0001 small ↘ The results achieved in our study indicated that modern READ-IT LEXICAL < 0.0001 negligible ↗ LLMs can simplify administrative documents almost as READ-IT SYNTACTIC < 0.0001 small ↘ efectively as humans. However, the achieved findings 2an FSERNGeldEVuemiAsldtpcaBDdhenia-stVIsitTceaacnSGIcnciaLmedOeIinxlBadAreixLty <<<<< 000000......000000000000900000211111 lsssnnammmeergggaaallellliilllggiibbllee ↘↘↗↗↗↗ itcgmnhoaodelnriidcsesueaseumtcxeteetastednhintnatsiotctiovLemetvLlheyaMealtunsshaiimaantnergpethlohinfieufedomtjthtuaefernuxistdlt..leiycFxTautch,lraittespheqacneuobrdiulvisenltaduogledfinntypotcrrceroeoesodupefulhrdcavreubiantelsoege-um Passive < 0.0001 negligible ↘ matically simplified documents. A manual investigation H READ-IT BASE 0.0292 negligible ↗ of our parallel corpus, supervised by expert jurists, may READ-IT LEXICAL reveal important implications in this sensitive context. READ-IT SYNTACTIC < 0.0001 negligible ↘ Another promising direction for future research is to READ-IT GLOBAL < 0.0001 negligible ↘ investigate the impact of automatic simplification on text Semantic Similarity < 0.0001 medium ↘ comprehension. An additional empirical study could be Edit distance < 0.0001 large ↗ designed to evaluate whether automatically simplified documents are easier to understand than their original versions.

In the above syntagmas, the similarities between the Additionally, it would be worthwhile to explore difsimplifications are quite obvious: for example, the tech- ferent prompting strategies to further improve simplinical term esercizio or the more ambiguous word pari are ifcation quality. For instance, few-shot prompting [ 50] replaced by the more common lexical equivalents anno with some manually simplified gold samples could better or uguale, respectively. align LLMs with human style.

5. Conclusion In this study, we investigated the automatic simplifica

tion of Italian administrative documents. Our results demonstrate that LLMs can efectively simplify these texts, performing comparably to humans 13.

Among the models examined, GPT-4 shows superior performance in text simplification, exhibiting significant improvements in complexity metrics. Nonetheless, it is noteworthy that humans tend to maintain a higher level of Edit distance and Semantic Similarity, ensuring the preservation of the original meaning and structure of the text. In other words, humans—aware of the importance of precise language for these documents—mostly preserved the original meaning and structure, whereas LLMs, while simplifying, tended to rephrase extensively. This rephrasing, although efective in reducing complexity, might inadvertently alter the legal nuances, which 13Further evidence showing that LLM simplifications preserve the meaning of the original texts was obtained in a study, conducted on the same data. The unpublished research indicated that experienced evaluators, i.e., jurists having administrative competence, agree that LLM simplifications of administrative texts maintain the legal integrity of the original documents [49].

Acknowledgments This contribution is a result of the research conducted

within the framework of the PRIN 2020 (Progetti di Rilevante Interesse Nazionale) “VerbACxSS: on analytic verbs, complexity, synthetic verbs, and simplification. For accessibility” (Prot. 2020BJKB9M), funded by the Italian Ministry of Universities and Research.

Giuliana Fiorentino and Rocco Oliveto are responsible for research question identification, study design, research supervision and data analysis. However, for academic reasons, Section 2, Section 3.1, Section 3.3, Section 4, and Section 5 are attributed to Vittorio Ganfi; and Section 1, Section 3, Section 3.2, Section 3.4 and Section 3.5 to Marco Russodivito. [2] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De- crosoft Bing, medRxiv (2023). doi:10.1101/2023. langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- 06.04.23290786. towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, [14] P. Mavrepis, G. Makridis, G. Fatouros, V. Koukos, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. M. Separdani, D. Kyriazis, Xai for all: Can large M. Drame, Q. Lhoest, A. Rush, Transformers: State- language models simplify explainable ai?, arXiv of-the-art natural language processing, in: Confer- preprint arXiv:2401.13110 (2024). ence on Empirical Methods in Natural Language [15] Y. Ma, S. Seneviratne, E. Daskalaki, Improving Text Processing: System Demonstrations (EMNLP), 2020, Simplification with Factuality Error Detection, in: pp. 38–45. Workshop on Text Simplification, Accessibility, and [3] M. J. Ryan, T. Naous, W. Xu, Revisiting non-English Readability (TSAR), 2022, pp. 173–178. text simplification: A unified multilingual bench- [16] F. Alva-Manchego, C. Scarton, L. Specia, Datamark, Association for Computational Linguistics Driven Sentence Simplification: Survey and Bench(ACL) (2023). mark, Computational Linguistics 46 (2020) 135–187. [4] D. Brunato, F. Dell’Orletta, G. Venturi, S. Monte- [17] M. Miliani, F. Alva-Manchego, A. Lenci, Simplifying magni, Design and Annotation of the First Italian Administrative Texts for Italian L2 Readers with Corpus for Text Simplification, in: Linguistic An- Controllable Transformers Models: A Data-driven notation Workshop (LAW), 2015, pp. 31–41. Approach., in: CLiC-it, 2023. [5] M. Miliani, S. Auriemma, F. Alva-Manchego, [18] D. Nozza, G. Attanasio, et al., Is it really that simA. Lenci, Neural readability pairwise ranking for ple? prompting language models for automatic text sentences in Italian administrative language, in: simplification in italian, in: CEUR Workshop ProAsia-Pacific Chapter of the Association for Compu- ceedings, 2023. tational Linguistics(AACL) and International Joint [19] D. Vellutino, et al., L’italiano istituzionale per la Conference on Natural Language Processing (IJC- comunicazione pubblica, Il mulino, Bologna, 2018.

NLP), 2022, pp. 849–866. [20] D. Vellutino, N. Cirillo, Corpus «itaist»: Note per [6] M. Miliani, M. S. Senaldi, G. Lebani, A. Lenci, Un- lo sviluppo di una risorsa linguistica per lo studio derstanding Italian Administrative Texts: A Reader- dell’italiano istituzionale per il diritto di accesso Oriented Study for Readability Assessment and civico, Italiano LinguaDue 16 (2024) 238–250. Text Simplification, in: Workshop on AI for Public [21] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. KaAdministration (AIxPA), 2022, pp. 71–87. plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas[7] S. Lubello, La lingua del diritto e try, A. Askell, et al., Language models are few-shot dell’amministrazione, Il mulino, Bologna, 2017. learners, Advances in Neural Information Process[8] M. Cortelazzo, Il linguaggio amministrativo. Prin- ing Systems (NIPS) 33 (2020) 1877–1901. cipi e pratiche di modernizzazione, Carocci, Roma, [22] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, 2021. F. L. Aleman, D. Almeida, J. Altenschmidt, S. Alt[9] G. Fiorentino, V. Ganfi, Parametri per semplificare man, S. Anadkat, et al., Gpt-4 technical report, l’italiano istituzionale: Revisione della letteratura, arXiv preprint arXiv:2303.08774 (2023).

Italiano LinguaDue 16 (2024) 220–237. [23] AI@Meta, Llama 3 model card (2024). URL: [10] E. Piemontese (Ed.), Il dovere costituzionale di farsi https://github.com/meta-llama/llama3/blob/main/ capire. A trent’anni dal Codice di stile, Carocci, MODEL_CARD.md.

Roma, 2023. [24] E. Piemontese, Criteri e proposte di semplificazione, [11] S. Lubello, Da dembsher al codice di stile e oltre: un in: Codice di stile delle comunicazioni scritte a uso bilancio sul linguaggio burocratico, in: E. Piemon- delle pubbliche amministrazioni, Istituto Poligrafico tese (Ed.), Il dovere costituzionale di farsi capire A e Zecca dello Stato, Roma, 1994. trent’anni dal Codice di stile, Carocci, Roma, 2023, [25] A. Fioritto, Manuale di stile. Strumenti per semplifipp. 54–70. care il linguaggio delle amministrazioni pubbliche, [12] G. Gonzalez Delgado, B. Navarro Colorado, The Il mulino, Bologna, 1997.

Simplification of the Language of Public Adminis- [26] F. Wilcoxon, Probability tables for individual comtration: The Case of Ombudsman Institutions, in: parisons by ranking methods, Biometrics 3 (1947) Proceedings of the Workshop on DeTermIt! Evalu- 119–122. ating Text Dificulty in a Multilingual Context, 2024, [27] N. Clif, Dominance statistics: Ordinal analyses to pp. 125–133. answer ordinal questions., Psychological bulletin [13] R. Doshi, K. Amin, P. Khosla, S. Bajaj, S. Chheang, 114 (1993) 494–509.

H. P. Forman, Utilizing large Language Models to [28] N. Clif, Ordinal methods for behavioral data analySimplify Radiology Reports: a comparative analysis sis, Psychology Press, New York, 2014. of ChatGPT3.5, ChatGPT4.0, Google Bard, and Mi- [29] E. Sulem, O. Abend, A. Rappoport, Semantic structural evaluation for text simplification, in: doi:10.3389/fpsyg.2022.707630. M. Walker, H. Ji, A. Stent (Eds.), Proceedings of the [39] D. Chandrasekaran, V. Mago, Evolution of semantic 2018 Conference of the North American Chapter similarity—A survey, ACM Computing Surveys of the Association for Computational Linguistics: (CSUR) 54 (2021) 1–37.

Human Language Technologies, Volume 1 (Long [40] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Papers), Association for Computational Linguistics, Y. Artzi, Bertscore: Evaluating text generation New Orleans, Louisiana, 2018, pp. 685–696. URL: with bert, in: International Conference on Learnhttps://aclanthology.org/N18-1063. doi:10.18653/ ing Representations, 2020. URL: https://openreview. v1/N18-1063. net/forum?id=SkeHuCVFDr. [30] W. Xu, C. Napoles, E. Pavlick, Q. Chen, C. Callison- [41] N. Reimers, I. Gurevych, Sentence-BERT: Sentence Burch, Optimizing Statistical Machine Translation Embeddings using Siamese BERT-Networks, in: for Text Simplification, Transactions of the As- Conference on Empirical Methods in Natural Lansociation for Computational Linguistics 4 (2016) guage Processing (EMNLP), Association for Com401–415. URL: https://doi.org/10.1162/tacl_a_00107. putational Linguistics, 2019.

doi:10.1162/tacl_a_00107. [42] A. Barayan, J. Camacho-Collados, F. Alva[31] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a Manchego, Analysing zero-shot readabilitymethod for automatic evaluation of machine trans- controlled sentence simplification, arXiv preprint lation, in: Proceedings of the 40th Annual Meet- arXiv:2409.20246 (2024). ing on Association for Computational Linguistics, [43] F. P. Miller, A. F. Vandome, J. McBrewster, LevACL ’02, Association for Computational Linguis- enshtein distance: Information theory, computer tics, USA, 2002, p. 311–318. URL: https://doi.org/ science, string (computer science), string metric, 10.3115/1073083.1073135. doi:10.3115/1073083. damerau? Levenshtein distance, spell checker, ham1073135. ming distance, Alpha Press, Olando, 2009. [32] F. Alva-Manchego, C. Scarton, L. Specia, The [44] D. Hendrycks, C. Burns, S. Basart, A. Zou, (Un)Suitability of Automatic Evaluation Metrics for M. Mazeika, D. Song, J. Steinhardt, Measuring Text Simplification, Computational Linguistics 47 massive multitask language understanding, Inter(2021) 861–889. URL: https://doi.org/10.1162/coli_ national Conference on Learning Representations a_00418. doi:10.1162/coli_a_00418. (ICLR) (2021). [33] S. Banerjee, A. Lavie, Meteor: An automatic metric [45] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi, for mt evaluation with improved correlation with Hellaswag: Can a machine really finish your senhuman judgments, in: Workshop on Intrinsic and tence?, in: Proceedings of the 57th Annual Meeting Extrinsic Evaluation Measures for Machine Trans- of the Association for Computational Linguistics, lation and/or Summarization, 2005, pp. 65–72. 2019, p. 4791–4800. [34] P. Lucisano, M. E. Piemontese, Gulpease: una for- [46] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabmula per la predizione della leggibilita di testi in harwal, C. Schoenick, O. Tafjord, Think you have lingua italiana, Scuola e città (1988) 110–124. solved question answering? try arc, the ai2 rea[35] V. Franchina, R. Vacca, Adaptation of flesh readabil- soning challenge, arXiv preprint arXiv:1803.05457 ity index on a bilingual text written by the same (2018). author both in italian and english languages, Lin- [47] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, guaggi 3 (1986) 47–49. M. Gardner, Drop: A reading comprehension bench[36] F. Dell’Orletta, S. Montemagni, G. Venturi, Read–it: mark requiring discrete reasoning over paragraphs, Assessing readability of italian texts with a view to in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedtext simplification, in: Proceedings of the second ings of the 2019 Conference of the North American workshop on speech and language processing for Chapter of the Association for Computational Linassistive technologies, 2011, pp. 73–83. guistics: Human Language Technologies, Volume 1 [37] T. De Mauro, I. Chiari, Il nuovo vo- (Long and Short Papers), 2019, pp. 2368–2378. cabolario di base della lingua italiana [48] N. Muennighof, N. Tazi, L. Magne, N. Reimers, (2016). URL: https://www.internazionale. MTEB: Massive text embedding benchmark, in: it/opinione/tullio-de-mauro/2016/12/23/ European Chapter of the Association for Computail-nuovo-vocabolario-di-base-della-lingua-italiana. tional Linguistics (EACL), 2023, pp. 2014–2037. [38] D. Brunato, F. Dell’Orletta, G. Venturi, [49] G. Fiorentino, M. Russodivito, V. Ganfi, R. Oliveto, Linguistically-Based Comparison of Difer- Validazione e confronto tra semplificazione autoent Approaches to Building Corpora for matica e semplificazione manuale di testi in italiano Text Simplification: A Case Study on Ital- istituzionale ai fini dell’eficacia comunicativa, in: ian, Frontiers in Psychology 13 (2022). Automated texts In the ROMance languages and beadvances of few-shot learning methods and applications, Science China Technological Sciences 66 (2023) 920–944.

A. Corpus ItaIst

The ItaIst corpus is a comprehensive collection of Italian administrative documents. Table 4 provides an overview of the topics and regions from which these documents were collected. This corpus has been assembled to represent the diversity and complexity of contemporary administrative Italian, ensuring its relevance for linguistic and computational analysis.

B. Prompt engineering In the context of LLMs, the term prompt refers to the

instructions provided to a language model to generate a specific response. Prompt engineering is the process of designing a clear and detailed prompt to instruct the model to generate a desired response. The prompt we used to ask the models to simplify administrative text is:

Sei un dipendente pubblico che deve scrivere dei documenti istituzionali italiani per renderli semplici e comprensibili per i cittadini. Ti verrà fornito un documento pubblico e il tuo compito sarà quello di riscriverlo appli- The Wilcoxon Signed-Rank Test and Clif’s Delta efect size cando regole di semplificazione senza però modificare il were employed to evaluate the metrics of GPT-3.5-Turbo, significato del documento originale. Ad esempio potresti LLaMA 3, and Phi 3 models in comparison to two human rendere le frasi più brevi, eliminare le perifrasi, esplicitare simplifiers, labelled as Human1 and Human2. These analsempre il soggetto, utilizzare parole più semplicii, trasfor- yses provide insights into the relative efectiveness of mare i verbi passivi in verbi di forma attiva, spostare le AI-driven simplifications versus human eforts. frasi parentetiche alla fine del periodo.

D. Examples C. Tests

Table 8 provides several examples of text simplification.

Table 5, Table 6, and Table 7 report the results of the For each example, we present the original text alongside statistical analyses conducted to compare the simplifica- its simplified versions. The values of the complexity and tion performance of various LLMs against human experts. similarity metrics are reported for each text.

l L’operatore di Polizia Locale, quindi, rappresenta un importante punto di riferimento per la collettività. Nell’ambito delle ian sue funzioni, esso svolge i propri compiti in maniera autorevole, dando prova di preparazione professionale e sensibilità r nel contatto relazionale. La sua attività, inoltre, è caratterizzata dal costante sforzo teso alla migliore interpretazione delle g i O situazioni e delle problematiche incontrate, applicando un approccio educativo e orientato alla più adeguata risposta ai bisogni della cittadinanza.

Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%) 38 12 77 % 28 % - 1 La Polizia Locale è un punto di riferimento per i cittadini. La Polizia Locale ha autorevolezza, professionalità e sensibilità an nel contatto con i cittadini. La Polizia Locale cerca sempre di interpretare al meglio situazioni e problematiche incontrate. um La Polizia Locale si comporta in modo da educare e rispondere adeguatamente ai bisogni dei cittadini. H Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%) 55 33 67 % 0 % 93 % 56 % 2 L’operatore di Polizia Locale, quindi, è un importante punto di riferimento per la collettività. Quando è in servizio, esso an svolge i propri compiti in maniera autorevole. Dimostra preparazione professionale e sensibilità nel contatto con gli um utenti. Cerca sempre di interpretare le situazioni e i problemi nel migliore dei modi. Applica un approccio educativo e H cerca di adeguarsi ai bisogni della cittadinanza.

Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%) 58 42 83 % 0 % 98 % 35 % L’operatore di Polizia Locale è un punto di riferimento importante per la comunità. Svolge i suoi compiti con autorità, -4 mostrando competenza professionale e sensibilità nelle relazioni. Inoltre, lavora sempre per capire al meglio le situazioni T PG e i problemi che incontra, usando un approccio educativo per rispondere adeguatamente ai bisogni dei cittadini.

Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%) 48 32 84 % 0 % 97 % 48 % o L’operatore di Polizia Locale è un punto di riferimento importante per la comunità. Nell’esercizio delle sue funzioni, rb svolge i compiti con autorevolezza, dimostrando professionalità e sensibilità nei rapporti con le persone. Si impegna uT costantemente a comprendere al meglio le situazioni e le problematiche afrontate, adottando un approccio educativo e .-35 orientato a rispondere in modo adeguato ai bisogni dei cittadini.

PT Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%) G 45 27 78 % 0 % 98 % 45 % 3 L’operatore di Polizia Locale è un punto di riferimento importante per la comunità. Esegue i suoi compiti con autorità, A professionalità e sensibilità nel rapporto con la gente. aM La sua attività è caratterizzata dal costante impegno per comprendere meglio le situazioni e i problemi, e per rispondere LL in modo educativo ai bisogni dei cittadini, con un approccio orientato alla loro assistenza.

Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%) 50 37 85 % 28 % 96 % 54 % L’operatore di Polizia Locale è un punto di riferimento importante per la comunità. Esegue i suoi compiti con autorità, i3 professionalità e sensibilità nel trattare le persone. La sua attività è guidata dal desiderio di capire meglio le situazioni e hP le problematiche, e di rispondere in modo appropriato ai bisogni dei cittadini, con un approccio educativo.

Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%) 52 38 82 % 28 % 96 % 56 %

[1]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention is all you need , in: Advances in Neural Information Processing Systems (NIPS) , volume 30 , 2017 .