AI vs. Human: Effectiveness of LLMs in Simplifying Italian Administrative Documents Marco Russodivito1,† , Vittorio Ganfi1,*,† , Giuliana Fiorentino1 and Rocco Oliveto1 1 University of Molise, Italy Abstract This study investigates the effectiveness of Large Language Models (LLMs) in simplifying Italian administrative texts compared to human informants. This research evaluates the performance of several well-known LLMs, including GPT-3.5-Turbo, GPT-4, LLaMA 3, and Phi 3, in simplifying a corpus of Italian administrative documents (s-ItaIst), a representative corpus of Italian administrative texts. To accurately compare the simplification abilities of humans and LLMs, six parallel corpora of a subsection of ItaIst are collected. These parallel corpora were analyzed using both complexity and similarity metrics to assess the outcomes of LLMs and human participants. Our findings indicate that while LLMs perform comparably to humans in many aspects, there are notable differences in structural and semantic changes. The results of our study underscore the potential and limitations of using AI for administrative text simplification, highlighting areas where LLMs need improvement to achieve human-level proficiency. Keywords Automatic Text Simplification, Large Language Models, Italian Administrative language 1. Introduction 2. From an analytical perspective, several statistical analyses were conducted to measure the seman- Due to the increasing popularity of generative Artifi- tic and complexity closeness between human and cial Intelligence (AI) language tools [1, 2], significant LLM-generated data. The comparison of scores attention has been devoted to the use of LLMs for text for both LLM and human datasets highlights sig- simplification [3]. Several studies have addressed the ap- nificant differences and similarities in manual and plication of LLMs to simplify texts, particularly focusing AI-driven simplification. on administrative documents, including those in Italian [4, 5, 6]. Italian administrative texts are often notably The results concerning readability indexes (e.g., Gulpease) complex and obscure [7, 8, 9], which restricts a large seg- and semantic and structural similarities (e.g., edit dis- ment of the population from fully accessing the content tance) reveal that LLMs generally perform comparably produced by the Italian public administration [10, 11]. to human informants. However, AI-simplified texts are This work aims to (a) evaluate the quality of automatic slightly less similar to the original documents than those text simplification performed by several well-known generated by human simplifiers. LLMs tend to introduce LLMs, and (b) compare LLM-based simplification with more changes in the simplified corpora than human anno- human-based simplification. To address these research tators. The empirical study indicates that texts simplified questions, the following procedures were undertaken: by AI exhibit more structural and lexical dissimilarities from the original documents than those simplified by 1. From an empirical perspective, a large corpus of humans. Italian administrative texts was collected (i.e., Replication package. All the codes and data ItaIst). A parallel simplified counterpart of the are available on Figshare at https://figshare.com/s/ corpus was created using different LLMs. Addi- 4d927fe648c6f1cb4227. tionally, a shorter version of the administrative corpus was manually simplified by two annota- tors. 2. Related Work CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Several researchers have conducted research on evalu- Dec 04 — 06, 2024, Pisa, Italy * Corresponding author. ating the accountability of LLMs in text simplification † These authors contributed equally. and on assessing the metrics employed to measure the $ marco.russodivito@unimol.it (M. Russodivito); quality of LLM text simplification [12, 13, 14, 15, 16]. In vittorio.ganfi@unimol.it (V. Ganfi); giuliana.fiorentino@unimol.it particular, numerous studies have focused on assessing (G. Fiorentino); rocco.oliveto@unimol.it (R. Oliveto) the use of LLMs to simplify Italian administrative texts,  0009-0004-8860-1739 (M. Russodivito); 0000-0002-0892-7287 highlighting the potential of these models to enhance (V. Ganfi); 0000-0002-0392-9056 (G. Fiorentino); text readability. Some studies have specifically evalu- 0000-0002-7995-8582 (R. Oliveto) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License ated the readability of simplified administrative texts Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings by comparing parallel corpora of simplified documents topics and text types of the main corpus. Table 1 provides and adopting a qualitative interpretative approach [17]. a summary of the s-ItaIst. Other contributions have assessed the outputs of LLMs in simplification tasks, particularly focusing on models Table 1 partially trained on Italian [18]. An overview of the main metrics of the s-ItaIst corpus. Our paper analyzes the differences between LLM and Metrics Value human simplification of Italian administrative texts, fol- # documents 8 lowing a quantitative approach. By examining these dif- # sentences 1,314 ferences, our study aims to highlight the similarities and # tokens 33,295 dissimilarities that emerge during the simplification of # types 5,622 administrative documents by humans and AI. 3. Study Design 3.2. LLMs Our study aims to analyze the effectiveness of modern To investigate both open-source and commercial mod- LLMs in simplifying administrative text. To achieve this, els, the s-ItaIst corpus was simplified using four distinct we address the following Research Question (RQ): commercial LLMs, namely GPT-3.5-Turbo [21] and GPT-4 [22] by OpenAI, LLaMA 3 [23] by Meta, and Phi 3 [23] by How effective are AI systems at simplifying Microsoft. For open-source models, we used the LLaMA 3 administrative texts compared to humans? 8B2 and Phi 3 3.8B3 variants, both fine-tuned on large Italian corpora. This selection explores models of vari- This question evaluates whether modern AI can achieve ous sizes while ensuring optimal performance for Italian a level of quality comparable to human experts, our refer- tasks. ences, by analyzing how well LLMs can reduce complex- A detailed prompt was formulated to instruct each ity while preserving the original meaning of the texts. model to perform the simplification task properly, avoid- The study has been conducted on a sub-corpus of ItaIst, ing summary and applying state-of-the-art simplification utilizing several LLMs to support the text simplification rules [9]. The full prompt can be found in Appendix B. process. The OpenAI models were accessed via APIs4 , while the open-source models were hosted on an AWS EC2 3.1. Corpus G65 instance equipped with a single Nvidia L4 GPU with 24GB vRAM. The ItaIst corpus has been created as part of the Ver- bACxSS research project. It was composed by linguists and jurists to create a representative linguistic resource 3.3. Experimental Procedure for contemporary administrative Italian [19, 20]. ItaIst To address our research question, we conducted an em- was assembled by collecting recent official documents pirical study to compare automatic and manual simpli- from local and regional public administration websites fications. Our study, illustrated in Figure 1, can be sum- of eight Italian regions (Basilicata, Calabria, Campania, marized in three main steps: (i) constructing a corpus of Lazio, Lombardy, Molise, Tuscany, and Veneto) covering administrative documents (i.e., s-ItaIst), (ii) simplifying topics such as garbage, healthcare, and public services. this corpus using four LLMs and two human annotators, The corpus includes a variety of text types, such as Ten- and (iii) comparing the LLM-simplified corpora with the ders Notices, Planning Acts, Services Charters. human-simplified corpora. The reliability of the corpus design was ensured by (a) It is worth noting that the s-ItaIst corpus was subdi- linguists, who checked the corpus represents administra- vided into small sections (2-6 sentences) to avoid exceed- tive Italian in terms of textual and diatopic features, and ing the context windows of the LLMs and to facilitate (b) jurists, who selected and validated each document human informants during simplification6 . included in ItaIst. The resulting corpus, comprising 208 documents, consists of around 2, 000, 000 tokens and 2 https://huggingface.co/DeepMount00/Llama-3-8b-Ita (last seen 07- 45, 000 types1 . More information about the ItaIst corpus 21-2024) 3 can be found in Appendix A. https://huggingface.co/e-palmisano/Phi3-ITA-mini-4K-instruct (last seen 07-21-2024) To make a fair comparison between humans and AI, a 4 https://openai.com/api/ (last seen 07-21-2024) sub-corpus of ItaIst (hereinafter, s-ItaIst) was extracted. 5 https://aws.amazon.com/it/ec2/instance-types/g6/ (last seen 07-21- The s-ItaIst sub-corpus was composed by selecting rep- 2024) 6 resentative documents from each region, balancing the s-ItaIst corpus was segmented into a total of 619 sections of text. Each section, then, was assigned to human annotators and LLMs 1 https://huggingface.co/datasets/VerbACxSS/ItaIst for simplification. Manual Automatic simplification simplification s-ItaIst Human1 Human2 GPT-4 GPT-3.5-Turbo LLAMA 3 Phi 3 Parallel Parallel Parallel Parallel Parallel Parallel Corpus Corpus Corpus Corpus Corpus Corpus Completixy Metrics Similarity Metrics Gulpease Index Semantic Similarity (%) Flesch-Vacca Index Edit Distance (%) NVdB (%) Passive verbs (%) Metrics Extractions Figure 1: Experimental design schema: The s-ItaIst corpus was simplified both automatically and manually by two humans and four LLMs. The resulting parallel corpora were analyzed using complexity and similarity metrics. Human annotators with strong backgrounds in linguis- In literature several simplicity measures (for instance, tics and deep knowledge about administrative text simpli- SAMSA [29], and SARI [30]) are employed, although fication simplified the corpus following common simplifi- their results may vary depending on the level of analysis cation rules identified in the literature [24, 25, 8, 9]. They examined and, of course, on the design of the metrics. exploited a custom web application that (i) assigned sec- Therefore, SAMSA aims to measure structural simplic- tions of the document to simplify and (ii) tracked the time ity through monitoring sentence splitting accuracy, and they spent during such an activity. Similarly, each LLM SARI was developed to measure the simplicity advan- was instructed to automatically simplify every document tage when just lexical paraphrasing was evaluated. Fur- in the corpus one section at a time. thermore, some study shows that when calculated using This approach provided a comprehensive comparison multi-operation manual references, both a generic met- dataset of six distinct parallel corpora. We analyzed these ric like BLEU [31] and an operation-specific one like data to compare human and automatic simplifications SARI have low associations with assessments of over- by extracting features such as complexity and similarity all simplicity[32]. Thus, to measure the readability of metrics to measure the quality of the simplified texts and investigated corpora we selected their relatedness to the original text. Furthermore, we 1. Flesch Vacca Index, Gulpease Index and READ-IT, computed the Wilcoxon Signed-Rank Test [26] to statisti- since they are advanced instruments designed cally evaluate the difference between LLMs and human to investigate the degree of simplicity of Italian metrics and Cliff’s Delta [27, 28] to provide a measure of texts, and the effect size. 2. percentages of some lexical and structural fea- tures (i.e., amount of most common lexical items 3.4. Metrics and active verb forms) increasing the readability To assess the quality of the simplifications, we employed of texts. both complexity and similarity metrics from the litera- Also for similarity metrics, computational literature ture. Complexity metrics compare the ease of the original offers several resources aiming to measure the structural and simplified text, while similarity metrics measure the or semantic proximity of texts. Some of these operate at distance between them. We implemented these metrics the n-gram overlap (e.g., BLEU [31] and METEOR [33]), according to the state-of-the-art, leveraging natural lan- while others consider other features. For this analysis, guage processing (NLP) techniques (e.g., tokenization, we select Semantic Similarity to quantify the degree of POS tagging7 ). semantic closeness between corpora and Edit distance to measure structural similarities between investigated 7 The process of tokenization and tagging was conducted using the corpora. spaCy natural language processing tool: https://spacy.io (last seen To support future research, we have made our metrics 07-21-2024) implementation publicly available8 . opted for the latter approach, which leverages Details concerning considered complexity metrics cosine similarity between contextual embeddings herein are shown: (obtained through sentence-transformers • Gulpease Index [34]: This metric evaluates the and an open-source multilingual model10 ) to eval- readability of an Italian text and assesses the edu- uate similarity at the sentence level, encapsulat- cation level required to fully comprehend it. It is ing the overall contextual meaning [42]. calculated using the following formula: • Edit distance (%) [43]: This metric measures the similarity between two strings based on the num- 300 * (𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠) − 10 * (𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑠) 89 + ber of single-character edits (insertions, deletions, 𝑡𝑜𝑘𝑒𝑛𝑠 (1) or substitutions) required to transform one text • Flesch Vacca Index [35]: This is an adaptation of into the other. A value close to zero indicates a the original Flesch Reading Ease formula for eval- relatively minor difference between the two texts, uating the readability of Italian texts, computed while a high value indicates significant rephras- as follows: ing. 𝑠𝑦𝑙𝑙𝑎𝑏𝑙𝑒𝑠 𝑡𝑜𝑘𝑒𝑛𝑠 217 − 130 * − (2) 𝑡𝑜𝑘𝑒𝑛𝑠 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠 3.5. Threats to validity • READ-IT [36]: The tool is the first advanced We analyze the validity of our study by examining con- readability evaluation instrument for Italian, com- struct, internal, and external validity. This evaluation bining traditional raw text features with lexi- helps us understand the strengths and limitations of our cal, morpho-syntactic, and syntactic information. methodology and the generalizability of our findings. Four different readability models are included in Construct validity: The two linguistic experts in- the tool: READ-IT BASE includes only raw fea- volved in the manual simplification of the s-ItaIst cor- tures, calculating sentence length (average num- pus may have produced divergent variants due to their ber of words per sentence) and word length (av- subjective approaches. Despite differences in seniority, erage number of characters per word); READ-IT both experts have strong linguistic backgrounds (holding LEXICAL combines raw (e.g., word length) and PhDs) and several years of experience. Nevertheless, in- lexical (e.g., Type/Token Ratio) features; READ- volving two human simplifiers allowed us to explore dis- IT SYNTACTIC employs raw text (e.g., sentence tinct simplification approaches and compare automatic length) and morpho-syntactic (e.g., average num- simplification against two varied benchmarks. ber of clauses per sentence) properties; READ-IT Internal validity: The LLMs used for automatic text GLOBAL includes all other features, combining simplification, particularly those from HuggingFace, may raw text, lexical, morpho–syntactic and syntactic have been trained on non-administrative texts, poten- (e.g., the depth of the whole parse tree) features 9 . tially introducing issues in the simplified text. However, • NVdB (%): "Il Nuovo vocabolario di base della lin- we relied on state-of-the-art models tested against several gua italiana" [37] consists of fundamental and benchmarks [44, 45, 46, 47]. Additionally, the embeddings commonly used words representing the essential for calculating Semantic Similarity were obtained through lexicon of the Italian language. The ease of a text a multilingual model chosen for its high ranking on the can be roughly estimated by the number of words listed in the basic vocabulary [38]. MTEB leaderboard11 , particularly for its performance in the STS22 benchmark (it) [48]. • Passive (%): Overuse of passive voice can lead to ambiguity and complexity, especially for read-External validity: Our study focuses on the sub- ers who may struggle with comprehension [24, corpus ItaIst, consisting of eight administrative docu- ments. Although the number of documents is relatively 25, 9]. It is calculated by identifying verbs with aux:pass occurring in the Dependency Parsing small, the corpus includes over 1, 000 sentences. Manual Tree. simplification of the corpus took Human1 and Human2 15 and 23 hours respectively. Extending our study to the Details concerning considered similarity metrics entire ItaIst corpus would have been infeasible. However, herein are shown: the documents of the ItaIst sub-corpus were not chosen • Semantic Similarity (%) [39]: This metric mea- randomly; they were selected to represent the variety of sures the distance between the semantic mean- administrative texts. ings of two documents. It can be computed ex- ploiting relevant methodologies from the litera- 10 https://huggingface.co/intfloat/multilingual-e5-base (last seen 07- ture, such as BERTscore[40] and SBERT [41]. We 21-2024) 8 11 https://pypi.org/project/italian-ats-evaluator (last seen 07-21-2024) https://huggingface.co/spaces/mteb/leaderboard (last seen 07-21- 9 http://www.italianlp.it/demo/read-it (last seen 04-10-2024) 2024) Table 2 Metrics evaluated across the original corpus and the human and LLM simplified corpora. Original Human1 Human2 GPT-3.5-Turbo GPT-4 LLaMA 3 Phi 3 Tokens 33,295 34,135 29,755 30,032 31,722 36,035 36,056 Sentences 1,314 1,506 1,744 1,515 1,840 1,944 1,900 Tokens per Sentences 25.33 22.66 17.06 19.53 17.24 18.53 18.97 Sentences per Documents 164.25 188.25 218.00 189.37 230.00 243.00 237.50 Gulpease Index 44.31 49.72 50.64 48.49 51.34 50.26 50.16 Flesch Vacca Index 19.97 34.23 33.63 30.33 36.75 34.09 33.75 NVdB (%) 73.28 80.44 76.89 78.28 81.07 80.18 80.16 Passive (%) 20.87 15.78 17.71 13.99 12.00 15.81 15.72 READ-IT BASE (%) 75.91 68.62 51.00 66.61 55.00 58.37 57.69 READ-IT LEXICAL (%) 93.64 85.37 89.71 91.96 90.29 77.13 75.74 READ-IT SYNTACTIC (%) 63.72 53.14 40.09 38.42 29.92 40.97 41.24 READ-IT GLOBAL (%) 86.48 69.24 61.34 68.69 54.60 59.26 58.37 Semantic Similarity (%) - 96.52 97.26 96.06 95.80 94.96 94.96 Edit distance (%) - 35.84 29.20 49.21 52.14 55.48 55.44 4. Results and Discussion LEXICAL). To validate our outcomes, we performed the Wilcoxon Signed-Rank Test and calculated Cliff’s Delta A preliminary analysis of our results, summarized in effect size to analyze the difference between GPT-4 and Table 2, reveals several significant similarities and differ- human metrics. By examining the results in Table 3, we ences between the human and LLM datasets. For instance, can assert that: the variation in the number of tokens is similar across both human and LLM corpora, although LLMs generally GPT-4 simplifications can be comparable increase the number of sentences more prominently than to human simplifications. GPT-4 simplifi- human annotators. cations are negligibly better for complexity Regarding complexity metrics, all the parallel corpora metrics, moderately worse for similarity, (both human and LLM) exhibit a general increase in read- and largely rephrased compared to human ability compared to the original texts. For example, the simplifications. majority of the corpora improve the Gulpease Index read- The results of the Wilcoxon Signed-Rank Test and Cliff’s ability metric, shifting the difficulty level from very dif- Delta Effect Size for the other models, though not fully ficult to difficult for middle school reading levels [34] significant, are listed in Appendix C. (except for Human1 and GPT-3.5-Turbo). Additionally, A brief extract taken from Original, Human1, Human2 complexity metrics vary similarly across both human and and GPT-4 parallel corpora, representing the same phrase LLM groups, with differences between manual and AI simplified by the two human annotators and GPT-4 is simplifiers not significantly greater than those between shown below 12 : Human1 and Human2 or among GPT-3.5-Turbo, GPT-4, LLaMA 3, and Phi 3. Original: fatturato minimo annuo, per The analysis of semantic and structural distance met- gli ultimi tre esercizi, pari o superiore al rics from the original s-ItaIst shows more pronounced valore stimato del presente appalto differences between human and LLM datasets. In terms Human1: Guadagno in un anno (fat- of semantic similarity (Semantic Similarity), the Human1 turato minimo annuo) negli ultimi 3 anni and Human2 corpora are closer to the original meaning di valore uguale o superiore al valore di than the LLM-simplified corpora. These differences are questo bando even more pronounced when considering edit distance Human2: l’ammontare di fatture emesse (Edit distance). The percentage of edit distance is higher annualmente, per gli ultimi tre anni, deve in the LLM group, with each LLM corpus exceeding the essere pari o superiore al valore stimato human ones by at least 10%. del presente appalto Higher degrees of Semantic Similarity and lower de- GPT-4: un fatturato annuo minimo, negli grees of Edit distance in human corpora indicate that ultimi tre anni, uguale o maggiore al val- human annotators tend to make fewer changes to the ore stimato dell’appalto original text compared to LLMs. 12 A more extensive example of data regarding human and LLM As reported in Table 2, GPT-4 achieved the best re- simplifications collected in the parallel corpora designed for this sults across the majority of metrics (except for READ-IT study can be found in Appendix D. Table 3 are critical in administrative texts. Results of the Wilcoxon Signed-Rank Test and Cliff’s Delta Despite this limitation, LLMs can serve as valuable Effect Size performed on GPT-4, Human1, and Human2 metrics. support tools for text simplification, significantly accel- Metrics p-value Effect Size erating a process that typically requires hours of manual Gulpease Index < 0.0001 negligible ↗ work. By generating initial drafts, LLMs can reduce the Flesch Vacca Index < 0.0001 negligible ↗ workload of human experts, who would then review and Human1 NVdB 0.0108 negligible ↗ refine the AI-generated drafts, ensuring the preservation Passive 0.0004 negligible ↘ of the overall meaning and legal integrity of the text. READ-IT BASE < 0.0001 small ↘ The results achieved in our study indicated that modern READ-IT LEXICAL < 0.0001 negligible ↗ LLMs can simplify administrative documents almost as READ-IT SYNTACTIC < 0.0001 small ↘ effectively as humans. However, the achieved findings READ-IT GLOBAL < 0.0001 small ↘ indicate that LLMs are not fully capable of preserving Semantic Similarity < 0.0001 small ↘ Edit distance < 0.0001 large ↗ the semantic meaning of the text, tending to rephrase Gulpease Index 0.0092 negligible ↗ more extensively than humans. This could introduce le- Flesch Vacca Index < 0.0001 negligible ↗ gal issues into the simplified text. Further study could be Human2 NVdB < 0.0001 small ↗ conducted to evaluate the juridical equivalence of auto- Passive < 0.0001 negligible ↘ matically simplified documents. A manual investigation READ-IT BASE 0.0292 negligible ↗ of our parallel corpus, supervised by expert jurists, may READ-IT LEXICAL reveal important implications in this sensitive context. READ-IT SYNTACTIC < 0.0001 negligible ↘ Another promising direction for future research is to READ-IT GLOBAL < 0.0001 negligible ↘ investigate the impact of automatic simplification on text Semantic Similarity < 0.0001 medium ↘ comprehension. An additional empirical study could be Edit distance < 0.0001 large ↗ designed to evaluate whether automatically simplified documents are easier to understand than their original versions. In the above syntagmas, the similarities between the Additionally, it would be worthwhile to explore dif- simplifications are quite obvious: for example, the tech- ferent prompting strategies to further improve simpli- nical term esercizio or the more ambiguous word pari are fication quality. For instance, few-shot prompting [50] replaced by the more common lexical equivalents anno with some manually simplified gold samples could better or uguale, respectively. align LLMs with human style. 5. Conclusion Acknowledgments In this study, we investigated the automatic simplifica- This contribution is a result of the research conducted tion of Italian administrative documents. Our results within the framework of the PRIN 2020 (Progetti di Rile- demonstrate that LLMs can effectively simplify these vante Interesse Nazionale) “VerbACxSS: on analytic verbs, texts, performing comparably to humans 13 . complexity, synthetic verbs, and simplification. For ac- Among the models examined, GPT-4 shows superior cessibility” (Prot. 2020BJKB9M), funded by the Italian performance in text simplification, exhibiting significant Ministry of Universities and Research. improvements in complexity metrics. Nonetheless, it is Giuliana Fiorentino and Rocco Oliveto are responsible for noteworthy that humans tend to maintain a higher level research question identification, study design, research of Edit distance and Semantic Similarity, ensuring the supervision and data analysis. However, for academic preservation of the original meaning and structure of reasons, Section 2, Section 3.1, Section 3.3, Section 4, and the text. In other words, humans—aware of the impor- Section 5 are attributed to Vittorio Ganfi; and Section 1, tance of precise language for these documents—mostly Section 3, Section 3.2, Section 3.4 and Section 3.5 to Marco preserved the original meaning and structure, whereas Russodivito. LLMs, while simplifying, tended to rephrase extensively. This rephrasing, although effective in reducing complex- ity, might inadvertently alter the legal nuances, which References 13 Further evidence showing that LLM simplifications preserve the [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, meaning of the original texts was obtained in a study, conducted L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, At- on the same data. The unpublished research indicated that expe- tention is all you need, in: Advances in Neural rienced evaluators, i.e., jurists having administrative competence, agree that LLM simplifications of administrative texts maintain Information Processing Systems (NIPS), volume 30, the legal integrity of the original documents [49]. 2017. [2] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De- crosoft Bing, medRxiv (2023). doi:10.1101/2023. langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- 06.04.23290786. towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, [14] P. Mavrepis, G. Makridis, G. Fatouros, V. Koukos, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. M. Separdani, D. Kyriazis, Xai for all: Can large M. Drame, Q. Lhoest, A. Rush, Transformers: State- language models simplify explainable ai?, arXiv of-the-art natural language processing, in: Confer- preprint arXiv:2401.13110 (2024). ence on Empirical Methods in Natural Language [15] Y. Ma, S. Seneviratne, E. Daskalaki, Improving Text Processing: System Demonstrations (EMNLP), 2020, Simplification with Factuality Error Detection, in: pp. 38–45. Workshop on Text Simplification, Accessibility, and [3] M. J. Ryan, T. Naous, W. Xu, Revisiting non-English Readability (TSAR), 2022, pp. 173–178. text simplification: A unified multilingual bench- [16] F. Alva-Manchego, C. Scarton, L. Specia, Data- mark, Association for Computational Linguistics Driven Sentence Simplification: Survey and Bench- (ACL) (2023). mark, Computational Linguistics 46 (2020) 135–187. [4] D. Brunato, F. Dell’Orletta, G. Venturi, S. Monte- [17] M. Miliani, F. Alva-Manchego, A. Lenci, Simplifying magni, Design and Annotation of the First Italian Administrative Texts for Italian L2 Readers with Corpus for Text Simplification, in: Linguistic An- Controllable Transformers Models: A Data-driven notation Workshop (LAW), 2015, pp. 31–41. Approach., in: CLiC-it, 2023. [5] M. Miliani, S. Auriemma, F. Alva-Manchego, [18] D. Nozza, G. Attanasio, et al., Is it really that sim- A. Lenci, Neural readability pairwise ranking for ple? prompting language models for automatic text sentences in Italian administrative language, in: simplification in italian, in: CEUR Workshop Pro- Asia-Pacific Chapter of the Association for Compu- ceedings, 2023. tational Linguistics(AACL) and International Joint [19] D. Vellutino, et al., L’italiano istituzionale per la Conference on Natural Language Processing (IJC- comunicazione pubblica, Il mulino, Bologna, 2018. NLP), 2022, pp. 849–866. [20] D. Vellutino, N. Cirillo, Corpus «itaist»: Note per [6] M. Miliani, M. S. Senaldi, G. Lebani, A. Lenci, Un- lo sviluppo di una risorsa linguistica per lo studio derstanding Italian Administrative Texts: A Reader- dell’italiano istituzionale per il diritto di accesso Oriented Study for Readability Assessment and civico, Italiano LinguaDue 16 (2024) 238–250. Text Simplification, in: Workshop on AI for Public [21] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- Administration (AIxPA), 2022, pp. 71–87. plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- [7] S. Lubello, La lingua del diritto e try, A. Askell, et al., Language models are few-shot dell’amministrazione, Il mulino, Bologna, 2017. learners, Advances in Neural Information Process- [8] M. Cortelazzo, Il linguaggio amministrativo. Prin- ing Systems (NIPS) 33 (2020) 1877–1901. cipi e pratiche di modernizzazione, Carocci, Roma, [22] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, 2021. F. L. Aleman, D. Almeida, J. Altenschmidt, S. Alt- [9] G. Fiorentino, V. Ganfi, Parametri per semplificare man, S. Anadkat, et al., Gpt-4 technical report, l’italiano istituzionale: Revisione della letteratura, arXiv preprint arXiv:2303.08774 (2023). Italiano LinguaDue 16 (2024) 220–237. [23] AI@Meta, Llama 3 model card (2024). URL: [10] E. Piemontese (Ed.), Il dovere costituzionale di farsi https://github.com/meta-llama/llama3/blob/main/ capire. A trent’anni dal Codice di stile, Carocci, MODEL_CARD.md. Roma, 2023. [24] E. Piemontese, Criteri e proposte di semplificazione, [11] S. Lubello, Da dembsher al codice di stile e oltre: un in: Codice di stile delle comunicazioni scritte a uso bilancio sul linguaggio burocratico, in: E. Piemon- delle pubbliche amministrazioni, Istituto Poligrafico tese (Ed.), Il dovere costituzionale di farsi capire A e Zecca dello Stato, Roma, 1994. trent’anni dal Codice di stile, Carocci, Roma, 2023, [25] A. Fioritto, Manuale di stile. Strumenti per semplifi- pp. 54–70. care il linguaggio delle amministrazioni pubbliche, [12] G. Gonzalez Delgado, B. Navarro Colorado, The Il mulino, Bologna, 1997. Simplification of the Language of Public Adminis- [26] F. Wilcoxon, Probability tables for individual com- tration: The Case of Ombudsman Institutions, in: parisons by ranking methods, Biometrics 3 (1947) Proceedings of the Workshop on DeTermIt! Evalu- 119–122. ating Text Difficulty in a Multilingual Context, 2024, [27] N. Cliff, Dominance statistics: Ordinal analyses to pp. 125–133. answer ordinal questions., Psychological bulletin [13] R. Doshi, K. Amin, P. Khosla, S. Bajaj, S. Chheang, 114 (1993) 494–509. H. P. Forman, Utilizing large Language Models to [28] N. Cliff, Ordinal methods for behavioral data analy- Simplify Radiology Reports: a comparative analysis sis, Psychology Press, New York, 2014. of ChatGPT3.5, ChatGPT4.0, Google Bard, and Mi- [29] E. Sulem, O. Abend, A. Rappoport, Semantic structural evaluation for text simplification, in: doi:10.3389/fpsyg.2022.707630. M. Walker, H. Ji, A. Stent (Eds.), Proceedings of the [39] D. Chandrasekaran, V. Mago, Evolution of semantic 2018 Conference of the North American Chapter similarity—A survey, ACM Computing Surveys of the Association for Computational Linguistics: (CSUR) 54 (2021) 1–37. Human Language Technologies, Volume 1 (Long [40] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Papers), Association for Computational Linguistics, Y. Artzi, Bertscore: Evaluating text generation New Orleans, Louisiana, 2018, pp. 685–696. URL: with bert, in: International Conference on Learn- https://aclanthology.org/N18-1063. doi:10.18653/ ing Representations, 2020. URL: https://openreview. v1/N18-1063. net/forum?id=SkeHuCVFDr. [30] W. Xu, C. Napoles, E. Pavlick, Q. Chen, C. Callison- [41] N. Reimers, I. Gurevych, Sentence-BERT: Sentence Burch, Optimizing Statistical Machine Translation Embeddings using Siamese BERT-Networks, in: for Text Simplification, Transactions of the As- Conference on Empirical Methods in Natural Lan- sociation for Computational Linguistics 4 (2016) guage Processing (EMNLP), Association for Com- 401–415. URL: https://doi.org/10.1162/tacl_a_00107. putational Linguistics, 2019. doi:10.1162/tacl_a_00107. [42] A. Barayan, J. Camacho-Collados, F. Alva- [31] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a Manchego, Analysing zero-shot readability- method for automatic evaluation of machine trans- controlled sentence simplification, arXiv preprint lation, in: Proceedings of the 40th Annual Meet- arXiv:2409.20246 (2024). ing on Association for Computational Linguistics, [43] F. P. Miller, A. F. Vandome, J. McBrewster, Lev- ACL ’02, Association for Computational Linguis- enshtein distance: Information theory, computer tics, USA, 2002, p. 311–318. URL: https://doi.org/ science, string (computer science), string metric, 10.3115/1073083.1073135. doi:10.3115/1073083. damerau? Levenshtein distance, spell checker, ham- 1073135. ming distance, Alpha Press, Olando, 2009. [32] F. Alva-Manchego, C. Scarton, L. Specia, The [44] D. Hendrycks, C. Burns, S. Basart, A. Zou, (Un)Suitability of Automatic Evaluation Metrics for M. Mazeika, D. Song, J. Steinhardt, Measuring Text Simplification, Computational Linguistics 47 massive multitask language understanding, Inter- (2021) 861–889. URL: https://doi.org/10.1162/coli_ national Conference on Learning Representations a_00418. doi:10.1162/coli_a_00418. (ICLR) (2021). [33] S. Banerjee, A. Lavie, Meteor: An automatic metric [45] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi, for mt evaluation with improved correlation with Hellaswag: Can a machine really finish your sen- human judgments, in: Workshop on Intrinsic and tence?, in: Proceedings of the 57th Annual Meeting Extrinsic Evaluation Measures for Machine Trans- of the Association for Computational Linguistics, lation and/or Summarization, 2005, pp. 65–72. 2019, p. 4791–4800. [34] P. Lucisano, M. E. Piemontese, Gulpease: una for- [46] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sab- mula per la predizione della leggibilita di testi in harwal, C. Schoenick, O. Tafjord, Think you have lingua italiana, Scuola e città (1988) 110–124. solved question answering? try arc, the ai2 rea- [35] V. Franchina, R. Vacca, Adaptation of flesh readabil- soning challenge, arXiv preprint arXiv:1803.05457 ity index on a bilingual text written by the same (2018). author both in italian and english languages, Lin- [47] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, guaggi 3 (1986) 47–49. M. Gardner, Drop: A reading comprehension bench- [36] F. Dell’Orletta, S. Montemagni, G. Venturi, Read–it: mark requiring discrete reasoning over paragraphs, Assessing readability of italian texts with a view to in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceed- text simplification, in: Proceedings of the second ings of the 2019 Conference of the North American workshop on speech and language processing for Chapter of the Association for Computational Lin- assistive technologies, 2011, pp. 73–83. guistics: Human Language Technologies, Volume 1 [37] T. De Mauro, I. Chiari, Il nuovo vo- (Long and Short Papers), 2019, pp. 2368–2378. cabolario di base della lingua italiana [48] N. Muennighoff, N. Tazi, L. Magne, N. Reimers, (2016). URL: https://www.internazionale. MTEB: Massive text embedding benchmark, in: it/opinione/tullio-de-mauro/2016/12/23/ European Chapter of the Association for Computa- il-nuovo-vocabolario-di-base-della-lingua-italiana. tional Linguistics (EACL), 2023, pp. 2014–2037. [38] D. Brunato, F. Dell’Orletta, G. Venturi, [49] G. Fiorentino, M. Russodivito, V. Ganfi, R. Oliveto, Linguistically-Based Comparison of Differ- Validazione e confronto tra semplificazione auto- ent Approaches to Building Corpora for matica e semplificazione manuale di testi in italiano Text Simplification: A Case Study on Ital- istituzionale ai fini dell’efficacia comunicativa, in: ian, Frontiers in Psychology 13 (2022). Automated texts In the ROMance languages and be- yond” (AI-ROM-II), 2nd International Conference, advances of few-shot learning methods and appli- To appear. cations, Science China Technological Sciences 66 [50] J. Wang, K. Liu, Y. Zhang, B. Leng, J. Lu, Recent (2023) 920–944. Table 5 Table 7 Results of the Wilcoxon Signed-Rank Test and Cliff’s Delta Results of the Wilcoxon Signed-Rank Test and Cliff’s Delta Effect Size performed on GPT-3.5-Turbo, Human1, and Human2 Effect Size performed on Phi 3, Human1, and Human2 metrics. metrics. Metrics p-value Effect Size Metrics p-value Effect Size Gulpease Index 0.0134 negligible ↗ Gulpease Index < 0.0001 negligible ↘ Flesch Vacca Index Human1 Flesch Vacca Index < 0.0001 negligible ↘ NVdB Human1 NVdB < 0.0001 negligible ↘ Passive Passive READ-IT BASE < 0.0001 small ↘ READ-IT BASE 0.0052 negligible ↘ READ-IT LEXICAL < 0.0001 negligible ↘ READ-IT LEXICAL < 0.0001 negligible ↗ READ-IT SYNTACTIC < 0.0001 small ↘ READ-IT SYNTACTIC < 0.0001 small ↘ READ-IT GLOBAL < 0.0001 small ↘ READ-IT GLOBAL Semantic Similarity < 0.0001 medium ↘ Semantic Similarity < 0.0001 small ↘ Edit distance < 0.0001 large ↗ Edit distance < 0.0001 medium ↗ Gulpease Index Gulpease Index < 0.0001 small ↘ Flesch Vacca Index Human2 Flesch Vacca Index < 0.0001 negligible ↘ NVdB < 0.0001 small ↗ Human2 NVdB < 0.0001 negligible ↗ Passive Passive 0.0072 negligible ↘ READ-IT BASE < 0.0001 negligible ↗ READ-IT BASE < 0.0001 small ↗ READ-IT LEXICAL < 0.0001 small ↘ READ-IT LEXICAL 0.0091 negligible ↗ READ-IT SYNTACTIC READ-IT SYNTACTIC READ-IT GLOBAL READ-IT GLOBAL 0.0003 negligible ↗ Semantic Similarity < 0.0001 large ↘ Semantic Similarity < 0.0001 medium ↘ Edit distance < 0.0001 large ↗ Edit distance < 0.0001 large ↗ Table 6 A. Corpus ItaIst Results of the Wilcoxon Signed-Rank Test and Cliff’s Delta Effect Size performed on LLaMA 3, Human1, and Human2 The ItaIst corpus is a comprehensive collection of Italian metrics. administrative documents. Table 4 provides an overview Metrics p-value Effect Size of the topics and regions from which these documents Gulpease Index 0.0077 negligible ↗ were collected. This corpus has been assembled to rep- Flesch Vacca Index resent the diversity and complexity of contemporary ad- Human1 NVdB ministrative Italian, ensuring its relevance for linguistic Passive and computational analysis. READ-IT BASE < 0.0001 small ↘ READ-IT LEXICAL < 0.0001 negligible ↘ READ-IT SYNTACTIC < 0.0001 small ↘ Table 4 READ-IT GLOBAL < 0.0001 small ↘ Topics and regions of documents collected in ItaIst Semantic Similarity < 0.0001 medium ↘ Garbage Healthcare Public services Edit distance < 0.0001 large ↗ Basilicata 8 3 9 Gulpease Index Calabria 11 5 9 Flesch Vacca Index Campania 14 7 9 Human2 NVdB < 0.0001 small ↗ Lazio 9 3 9 Passive Lombardia 15 3 11 READ-IT BASE < 0.0001 negligible ↗ Molise 10 7 9 READ-IT LEXICAL < 0.0001 small ↘ Toscana 19 4 12 READ-IT SYNTACTIC Veneto 9 5 10 READ-IT GLOBAL Semantic Similarity < 0.0001 large ↘ Edit distance < 0.0001 large ↗ B. Prompt engineering In the context of LLMs, the term prompt refers to the instructions provided to a language model to generate a specific response. Prompt engineering is the process of designing a clear and detailed prompt to instruct the model to generate a desired response. The prompt we used to ask the models to simplify administrative text is: Sei un dipendente pubblico che deve scrivere dei doc- umenti istituzionali italiani per renderli semplici e com- prensibili per i cittadini. Ti verrà fornito un documento pubblico e il tuo compito sarà quello di riscriverlo appli- The Wilcoxon Signed-Rank Test and Cliff’s Delta effect size cando regole di semplificazione senza però modificare il were employed to evaluate the metrics of GPT-3.5-Turbo, significato del documento originale. Ad esempio potresti LLaMA 3, and Phi 3 models in comparison to two human rendere le frasi più brevi, eliminare le perifrasi, esplicitare simplifiers, labelled as Human1 and Human2. These anal- sempre il soggetto, utilizzare parole più semplicii, trasfor- yses provide insights into the relative effectiveness of mare i verbi passivi in verbi di forma attiva, spostare le AI-driven simplifications versus human efforts. frasi parentetiche alla fine del periodo. D. Examples C. Tests Table 8 provides several examples of text simplification. Table 5, Table 6, and Table 7 report the results of the For each example, we present the original text alongside statistical analyses conducted to compare the simplifica- its simplified versions. The values of the complexity and tion performance of various LLMs against human experts. similarity metrics are reported for each text. Table 8 Examples of simplifications. L’operatore di Polizia Locale, quindi, rappresenta un importante punto di riferimento per la collettività. Nell’ambito delle Original sue funzioni, esso svolge i propri compiti in maniera autorevole, dando prova di preparazione professionale e sensibilità nel contatto relazionale. La sua attività, inoltre, è caratterizzata dal costante sforzo teso alla migliore interpretazione delle situazioni e delle problematiche incontrate, applicando un approccio educativo e orientato alla più adeguata risposta ai bisogni della cittadinanza. Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%) 38 12 77 % 28 % - - La Polizia Locale è un punto di riferimento per i cittadini. La Polizia Locale ha autorevolezza, professionalità e sensibilità Human1 nel contatto con i cittadini. La Polizia Locale cerca sempre di interpretare al meglio situazioni e problematiche incontrate. La Polizia Locale si comporta in modo da educare e rispondere adeguatamente ai bisogni dei cittadini. Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%) 55 33 67 % 0% 93 % 56 % L’operatore di Polizia Locale, quindi, è un importante punto di riferimento per la collettività. Quando è in servizio, esso Human2 svolge i propri compiti in maniera autorevole. Dimostra preparazione professionale e sensibilità nel contatto con gli utenti. Cerca sempre di interpretare le situazioni e i problemi nel migliore dei modi. Applica un approccio educativo e cerca di adeguarsi ai bisogni della cittadinanza. Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%) 58 42 83 % 0% 98 % 35 % L’operatore di Polizia Locale è un punto di riferimento importante per la comunità. Svolge i suoi compiti con autorità, GPT-4 mostrando competenza professionale e sensibilità nelle relazioni. Inoltre, lavora sempre per capire al meglio le situazioni e i problemi che incontra, usando un approccio educativo per rispondere adeguatamente ai bisogni dei cittadini. Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%) 48 32 84 % 0% 97 % 48 % L’operatore di Polizia Locale è un punto di riferimento importante per la comunità. Nell’esercizio delle sue funzioni, GPT-3.5-Turbo svolge i compiti con autorevolezza, dimostrando professionalità e sensibilità nei rapporti con le persone. Si impegna costantemente a comprendere al meglio le situazioni e le problematiche affrontate, adottando un approccio educativo e orientato a rispondere in modo adeguato ai bisogni dei cittadini. Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%) 45 27 78 % 0% 98 % 45 % L’operatore di Polizia Locale è un punto di riferimento importante per la comunità. Esegue i suoi compiti con autorità, LLaMA 3 professionalità e sensibilità nel rapporto con la gente. La sua attività è caratterizzata dal costante impegno per comprendere meglio le situazioni e i problemi, e per rispondere in modo educativo ai bisogni dei cittadini, con un approccio orientato alla loro assistenza. Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%) 50 37 85 % 28 % 96 % 54 % L’operatore di Polizia Locale è un punto di riferimento importante per la comunità. Esegue i suoi compiti con autorità, Phi 3 professionalità e sensibilità nel trattare le persone. La sua attività è guidata dal desiderio di capire meglio le situazioni e le problematiche, e di rispondere in modo appropriato ai bisogni dei cittadini, con un approccio educativo. Gulpease Index Flesch Vacca Index NVdB (%) Passive (%) Semantic Similarity (%) Edit distance (%) 52 38 82 % 28 % 96 % 56 %