1. Introduction

Generating and Evaluating Multi-Level Text Simplification: A Case Study on Italian

Michele Papucci

0 1

Giulia Venturi

Felice Dell'Orletta

0 0 ItaliaNLP Lab @ Institute for Computational Linguistics, National Research Council , Pisa 1 University of Pisa , Pisa

2025

Recent advances in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, yet controlling model outputs remains a challenge. In this study, we explore the use of LLMs to generate high-quality synthetic data for Automatic Text Simplification (ATS), evaluating the ability of models fine-tuned on Italian to produce multiple simplified versions of the same original sentence that vary in readability and in their lexical and (morpho-)syntactic characteristics. The approach is tested across two domains, Wikipedia and Public Administration, allowing us to explore domain sensitivity. Additionally, we compare the linguistic phenomena observed in the generated data with those found in ATS resources previously created through manual or semi-automatic methods. Our results suggest that the best-performing LLM can generate linguistically diverse simplifications that align with known simplification patterns, ofering a promising direction for building reliable ATS resources, including simplifications suited to varying levels of reader proficiency.

eol>Automatic Text Simplification Large Language Models Synthetic Data Linguistic Complexity Sentence Readability

1. Introduction

phrase generation [ 7, 8 ] or machine translation [ 9, 10 ].

More recently, Large Language Models (LLMs) have Automatic Text Simplification (ATS) aims to reduce the introduced a new paradigm for ATS, also opening the linguistic complexity of a text while preserving its mean- possibility of generating synthetic resources whose qualing. Given that the dominant approach is data-driven, ity still requires thorough assessment [ 2 ]. This trend where models learn simplification operations from exam- aligns with broader eforts to leverage LLMs for alleviatples of complex-simple sentence pairs [ 1 ], the availability ing the limitations of real-world data through synthetic and nature of resources for ATS play a crucial role in de- data generation [ 11 ]. Evaluation initiatives such as BLESS termining the quality of these models. [12] have demonstrated that LLMs, under a few-shot

Traditionally, manually constructed resources have setting, are capable of generating simplified sentences been favored for their reliability and controllability [ 2 ]. across multiple datasets, languages, and prompts. Yet, However, the cost and labor-intensiveness of such eforts research to date has primarily focused on English and limit their scalability, domain coverage, and language has relied on a limited set of evaluation metrics, leaving diversity. To address these limitations, researchers have open questions about model behavior across diferent explored unsupervised methods for resource construc- domains, languages, and target user needs. Notable extion, including mining sentence pairs from aligned cor- ceptions for the Italian language include [13] and [14], pora, primarily Wikipedia and Simple Wikipedia [ 3 ], or who assessed the ability of both open and proprietary exploiting crowdsourcing approaches [ 4, 5 ]. In light of LLMs to produce simplified sentences. The former foconcerns about the suitability of Wikipedia as an ATS cused on increased sentence readability, while the latresource [ 6 ], and to tackle the broader scarcity of parallel ter examined both readability and semantic similarity, simplification data especially for low-resource languages, comparing model-generated simplifications with those researchers have also proposed methods to automatically written by human simplifiers. Interestingly, both studies create parallel resources, inspired for example by para- targeted the administrative domain. Starting from these premises, this paper introduces a CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- multifaceted approach to assess the ability of three small tics, September 24 — 26, 2025, Cagliari, Italy LLMs fine-tuned on the Italian language to generate seng$iumliai.cvheenletu.prai@puilccc.ic@nrp.iht d(G.u.nVipein.ittu(rMi);. fPealipcuec.dcei)l;lorletta@ilc.cnr.it tence simplifications along a gradient of complexity. Af(F. Dell’Orletta) ter identifying the best-performing model, we examined https://michelepapucci.github.io/ (M. Papucci); its output along three main dimensions: i) its ability to http://www.italianlp.it/people/giulia-venturi/ (G. Venturi); produce multiple simplifications for the same input senhttp://www.italianlp.it/people/felice-dellorletta/ (F. Dell’Orletta) tence with increasing levels of readability; ii) the extent (G.0V0e0n0t-u0r0i0)3-4251-7254 (M. Papucci); 0000-0001-5849-0979 to which the linguistic characteristics of the simplified © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License sentences difer from those of the original; and iii) the reAttribution 4.0 International (CC BY 4.0).

3. Experimental Settings lationship between the distribution of linguistic features and the readability level. This in-depth linguistic analysis of LLM-generated simplifications aims to achieve LLM selection. To identify the most suitable LLM for two main objectives. First, it investigates whether small, the task of generating simplified sentences, we considopen LLMs can reliably produce multiple simplifications ered three models specifically developed for the Italian with varying degrees of linguistic complexity, thereby language, which difer in terms of architecture and numofering a scalable strategy for creating resources tailored ber of parameters: ANITA1 [17], LLaMAntino-22 [18], to diferent target populations, which remain scarce [ 2 ]. and Italia3. All models were tested in a 0-shot setting. The Second, it aims to explore whether specific linguistic models’ performance was evaluated against the test splits patterns observed in original–simplified sentence pairs of the following Italian sentence simplification datasets: are influenced by the approach used to construct ATS 51 paired original/simplified sentences from SIMPITIKI 4 resources, as discussed in [15]. [19], 994 sentence pairs filtered from PaCCSS–IT [ 7 ], 101 sentence pairs from the Terence corpus and 17 from the Teacher corpus [16], 49 sentence pairs extracted from 2. Methodology ADMIN-it [20], for a total of 1,212 sentence pairs. As evaluation metrics, we selected a set of complemenThe approach we propose for assessing the ability of tary measures addressing diferent aspects of sentence LLMs to automatically generate sentence simplifications simplification. Specifically, we included i) two metrics along a gradient of linguistic complexity is articulated in widely used in the literature that focus on surface-level three main steps: properties related to writing style, i.e. BLEU [21] and 1. selection of an LLM fine-tuned on the Italian lan- SARI [22], and ii) two semantic similarity metrics used guage, capable of reliably generating sentences in to assess meaning preservation, i.e. BertScore [23] and the target language, and identification of a corpus SentenceTransformer Similarity [24, 25]. In addition, we of human-written sentences to be used as original evaluated the simplified sentences in terms of variation in inputs; readability computed by READ-IT [26], the first machine2. prompting the selected LLM to generate multiple learning-based automatic readability assessment tool desimplified versions of each original sentence to veloped for Italian, combining traditional surface features obtain diverse outputs per input; with lexical, morpho-syntactic, and syntactic information 3. evaluation of the resulting sentence pairs in terms correlated with linguistic complexity.

of their linguistic feature diversity and variation All models were evaluated on a single generation for in readability levels. each input. Each model was prompted using its respective system prompt, combined with a shared task-specific

The main objective of the first two steps, described instruction to simplify the text while preserving the origiin Section 3, is to construct a parallel corpus composed nal meaning.5. The results are reported in Table 1, where of human-written original sentences and multiple au- it should be noted that the evaluation metrics follow tomatically generated simplified versions. This allows an increasing trend, meaning that higher scores correfor capturing a range of sentence transformations char- spond to more simplified sentences. In contrast, READacterized by diferent linguistic phenomena. In this re- IT scores exhibit the opposite trend: they range from 0 spect, the proposed methodology is particularly suitable (most readable sentence) to 100 (least readable sentence), for low-resource languages, where simplified corpora re- as they reflect the level of linguistic complexity of the inmain scarce, especially those addressing multiple reader put. Notably, LLaMAntino-2 consistently outperformed profiles, domains, or textual genres. the other LLMs across all evaluation metrics, generat

The evaluation of the generated simplifications, which ing sentences that are simpler than the original inputs constitutes the main focus of this study, is presented in both surface-level properties and semantic content. in Section 4. Our multifaceted evaluation methodology Moreover, its outputs had the lowest READ-IT scores, aims to assess not only how readability levels vary across indicating that they are the least linguistically complex the multiple simplifications and relative to the original among those produced by the tested models. As a result, sentence, but also how the lexical, morpho-syntactic, and it was selected for the second step of our methodology. syntactic characteristics of the sentence pairs change. A further contribution of this study lies in a comparative 1HuggingFace handle: swap-uniba/LLaMAntino-3-ANITA-8B-Instanalysis designed to explore whether specific linguistic 2DHPuOgg-iInTgAFace handle: swap-uniba/LLaMAntino-2-7b-hf-dolly-ITA phenomena observed in the LLM-generated simplifica- 3HuggingFace handle: iGeniusAI/Italia-9B-Instruct-v0.1 tions resemble those found in existing Italian ATS re- 4From SIMPITIKI we took only the Wikipedia sentence pairs and sources, specifically two created manually [ 16] and one excluded the Administrative domain ones, since those are the same semi-automatically [ 7 ]. 5sSeenetAenpcpeesnadlirxeaAdyfoprrmesoernet dinetAaiDlsM.IN-IT.

Model ANITA LLaMAntino-2 Italia

Textual domains. We tested the full experimental set- keeping the original information content. For instance, ting on two corpora representative of two Italian lan- the simplest sentence (i.e. the sentence with the lowest guage varieties that are widely acknowledged to exhibit READ-IT score) is characterized by a reduced distance significantly diferent linguistic features. Specifically, between the nominal subject (le concessioni ‘the conceswe selected a collection of sentences downloaded from sions’) and the main verb (devono essere considerate ‘must Wikipedia pages, as it is the most frequently addressed be considered’). In addition, the main verb undergoes domain in the literature on ATS [ 2 ]. As a counterpart, i) a lexical simplification since the simpler considerare we included the “PaWaC – Public Administration Web ‘to consider’ replaces the more complex original verb as Corpus” (PaWaC [27]), which contains a wide range interdersi ‘to understand’ and ii) a morphological simpliof administrative texts (resolutions, circular letters, etc.) fication since the epistemic future is replaced by a more and represents the Italian language used in public ad- straightforward present-tense form. Also in the case of ministration, a language variety well-known for its high the Wikipedia example, the most simplified sentences level of multilevel linguistic complexity [28]. For both are the result of structural transformations. Namely, the domains, we randomly sampled 10,000 sentences to serve two versions with the lowest READ-IT scores contain as the original texts for generating multiple simplified the main at the active voice instead of the passive, and variants. feature shorter syntactic dependency links among words. Generation of multiple simplifications. Step two Linguistic profiling. Our evaluation step includes a of our methodology was performed by prompting comparative analysis of the distribution of multilevel linLLaMAntino-2 with the same prompt introduced previ- guistic features automatically extracted from the original ously to generate multiple simplified versions for the col- and the LLaMAntino-2–generated simplified sentences. lection of the original 10,000 sentences for the Wikipedia To this end, we adopted Profiling-UD [ 30], a web-based and administrative domains. To this end, we employed tool designed to linguistically profile multilingual texts the Divergent Beam Search decoding technique [29] to using the Universal Dependencies (UD) formalism [31]. obtain multiple simplifications for each original sentence. The full set of features is detailed in Table 3. They can Through manual inspection of the outputs generated un- be grouped into nine categories, each corresponding to der diferent decoding settings, we found that using 20 linguistic phenomena possibly related to sentence combeams divided into 10 groups, with a diversity penalty plexity. These range from raw text properties, such as = 0.7, provided the best results in terms of diversity sentence length, to more complex features, such as the of the simplifications and text fluency. distribution of UD Parts-of-Speech (POS), dependency

Using this decoding strategy, we obtained 10 simplifi- relations, and verbal inflectional properties (e.g. mood, cations for each original sentence. The resulting resource tense)6. Additional features capture global and local synwas automatically revised by removing duplicate simpli- tactic structure, such as the structure of verbal predicates, ifcations and cases where the original and simplified sen- the order of nuclear sentence elements (subject and obtences were identical. After this clean-up, we obtained ject) relative to the verb, and the use of subordination. 71,837 original/simplified sentence pairs for Wikipedia The set also includes features modeling the lexical variety and 78,184 pairs for PaWaC. of sentences, specifically in terms of the i) distribution

Table 2 reports two examples randomly extracted from of word forms and lemmas belonging to the New Basic the generated resource. Concerning the administrative Italian Vocabulary (NBIV) [32], further classified into the domain, we can see that the least simplified PaWac sen- three usage categories, and ii) distribution of lemmas tences (i.e. those with the higher READ-IT scores) are sim- based on classes of frequency computed using the Italian plified primarily through the deletion of informational section of the MC4 corpus [33, 34] according to the folcontent (e.g. non automaticamente rinnovabili ‘not automatically renewable’ is removed). In contrast, the most 6Foorgr/uth/peolsi/sitndoefxU.hDtmPl,OS orfefersytnotahctttipcs://ruenlaivtieornsasldetpoendhetntpcsie:/s/. simplified sentences display linguistic features typically universaldependencies.org/u/dep/index.html, and of verbal morphoassociated with more readable sentence structures while logical tags https://universaldependencies.org/u/feat/index.html.

Original

Simplifications Original Simplifications PaWac

Stabilito pertanto che le concessioni rilasciate in base al presente bando dovranno intendersi come decennali e non automaticamente rinnovabili (It is therefore established that the concessions granted under this call shall be understood decennial and not automatically renewable) Stabilito pertanto che le concessioni rilasciate in base al presente bando dovranno intendersi come decennali. (It is therefore established that the concessions granted under this call shall be understood as decennial.) Stabilito pertanto che le concessioni rilasciate in base al presente bando dovranno intendersi come decennali e non rinnovabili. (It is therefore established that the concessions granted under this call shall be understood as decennial and not renewable.) Il presente bando stabilisce che le concessioni devono essere considerate decennali e non automaticamente rinnovabili. (This call establishes that the concessions must be considered decennial and not automatically renewable.) Il presente bando stabilisce che le concessioni devono essere considerate decennali e non rinnovabili automaticamente (This call establishes that the concessions must be considered decennial and not renewable automatically.) Quindi le concessioni basate sul presente bando devono essere considerate decennali e non automaticamente rinnovabili. (Therefore the concessions based on this call must be considered decennial and not automatically renewable)

Wikipedia

Alcuni composti aromatici più pesanti, come lo xilene, possono essere utilizzati al posto del toluene ottenendo rese comparabili. (Some heavier aromatic compounds, such as xylene, can be used instead of toluene obtaining comparable yields.) Alcuni composti aromatici più pesanti possono essere utilizzati al posto del toluene ottenendo rese comparabili. (Some heavier aromatic compounds can be used instead of toluene obtaining comparable yields.) La maggior parte degli aromi più pesanti possono essere utilizzati al posto di toluene ottenendo rese comparabili. (The majority of heavier aromatics can be used in place of toluene obtaining comparable results.) La maggior parte degli aromi più pesanti possono essere utilizzati al posto di toluene. (The majority of heavier aromatics can be used in place of toluene.) È possibile utilizzare xilene invece di toluene per ottenere un prodotto finale simile. (It is possible to use xylene instead of toluene to obtain a similar end product.) È possibile utilizzare xilene invece di toluene per ottenere una resa simile. (It is possible to use xylene instead of toluene to obtain a comparable yield.) .70 .61 .34 .31 .29 .59 .34 .25 .21 .16 .15 lowing function: = ⌊log2 (()) ⌋, where MFL is the most frequent lemma in the corpus and CL is the considered lemma.

4. Linguistic Analysis of Simplified Sentences

The evaluation of the LLaMAntino-2–generated simpliifed sentences was conducted both in terms of readability scores (see Section 4.1) and linguistic profiles (see Section 4.2) in comparison to their corresponding original sentences. In addition, we investigated whether there is a relationship between the changes in linguistic features and the variation in readability levels across original/simplified sentence pairs, with the aim of identifying which linguistic phenomena are most associated with variation in linguistic complexity (see Section 4.3). All evaluations were conducted considering a randomly sampled subset of 2,000 paired original/simplified sentences for each domain7. Finally, Section 4.4 presents the results of a comparative analysis designed to examine whether different approaches to the construction of ATS resources influence the linguistic characteristics of simplified texts. 4.1. Sentence Readability The first evaluation step was conducted by considering, for each original sentence, three representative cases among the multiple automatically generated simplifica7The dataset is freely available at https://github.com/michelepapucci/ multilevel-text-simplification-italian tions: the Most simplified sentence , i.e. the one with the lowest READ-IT score, the Least simplified sentence , with the highest score, and a Randomly-selected simplification , selected from the remaining simplifications. The comparison was computed adopting the Kernel Density Estimation (KDE), a probability distribution estimate obtained by smoothing out the READ-IT data points to create a continuous curve. Results are reported in Figure 1, where we can see that for both domains, all three types of simplifications exhibit a higher frequency of data points with lower READ-IT scores, confirming that the simplified sentences are generally easier to read. However, the shape of the distributions indicates that readability improvements vary depending on the source domain. Specifically, Wikipedia original sentences show a more uniform distribution across READ-IT scores, while PaWaC sentences are more concentrated at the higher end of the readability spectrum. This indicates that the simplified sentences in the administrative corpus remain less accessible than Wikipedia simplified sentences, reflecting the intrinsically higher linguistic complexity of administrative texts. Looking at the multiple simplifications, the Most simplified sentences exhibit a strongly left-skewed distribution in both domains, indicating that at least one version per original achieves significantly lower READIT scores. For the Randomly-selected simplifications , the KDE curve for Wikipedia shows a marked shift toward lower scores, suggesting that model-generated simplifications are generally simpler than their originals. A similar trend is observed for the PaWaC domain, although the distribution is flatter and less uniform, indicating greater variability across the simplified outputs. 4.2. Linguistic Features The linguistic profile–based evaluation is twofold. The ifrst level focuses on analyzing the diferences between each of the three types of generated simplifications and Feature sent_len aux_Sub verbal_head subord_3 tree_depth subord_prop verbs_Ind verbs_Fut avg_Schain_len n_prep_chains links_len_max subord_post highest_class principal_prop dep_iobj verbs_Sing3 obj_pre subord_1 verbs_Ger dep_aux upos_AUX links_len_avg verbs_Plur3 avg_Pchain_len aux_Part subj_post dep_appos obj_post verbs_Pres aux_Pres verb_edges_5 verb_edges_0 dep_parataxis verb_edges_1 subord_2 verbs_Fin aux_Inf their corresponding original sentence, in terms of linguistic profile. To this end, we applied a Multivariate Analysis of Variance (MANOVA), which, unlike traditional ANOVA that considers only a single dependent variable, MANOVA evaluates whether the mean vectors

Original vs Least Simplified Original vs Randomly-Selected Original vs Most Simplified

of multiple dependent variables difer significantly between groups, making it well-suited to our multi-feature linguistic profiling. To quantify the degree of diference in each comparison, we report Pillai’s Trace, one of the statistics derived from MANOVA. Pillai’s Trace is particularly robust, especially in situations where assumptions like homogeneity of covariance matrices may be violated.

Higher values of Pillai’s Trace indicate greater multivariate diferences between groups.

The results, summarized in Table 4, show that all related to sentence complexity, regardless of the textual comparisons yield statistically significant diferences domain, and are typically modified to improve sentence ( ≤ 10− 4) in both domains. Among the three sets, the readability. As expected, among these features we find Least Simplified sentences consistently yield the small- sentence length (sent_len), which displays the highest est Pillai’s Trace values (.12 for Wikipedia and .16 for score in Wikipedia and the second highest in PaWaC. PaWaC), indicating the greatest similarity to the origi- However, by inspecting the diferences across domains, nal sentences. In contrast, the Most Simplified sentences we observe that administrative sentences are particularly show the highest values (.44 and .46), indicating that shortened compared to their originals. Since the majority the simplification process led to substantial transforma- of the features considered are closely tied to sentence tions in their linguistic profiles. The Randomly-Selected length, this outcome may impact the distribution of the simplifications fall in between, though they are closer other most varying features. to the least simplified set, indicating that they retain a Nevertheless, we can see that several features modelconsiderable degree of the original sentences’ linguistic ing diferent syntactic properties of sentences are highly characteristics. This aligns with the trend observed in ranked in terms of score for both domains. One such feaFigure 1, where the KDE curve for the Randomly-Selected ture is the distribution of verbal heads (verbal_head), i.e. simplifications peaks at lower READ-IT scores, similar tokens POS-tagged as verbs that function as the syntactic to the most simplified set, but also shows a broader tail, head in dependency relations, which is notably reduced indicating that some of these sentences remain close in in the simplified sentences. This reduction is closely readability to the originals. This trend is shared across do- linked to the decreased use of subordination, as indicated mains, even with some diferences that highlight domain- by lower values of a set of related features capturing this specific characteristics of the simplification process. phenomenon. The set includes: the overall distribution of

Notably, we generally observe slightly higher Pillai’s subordinate clauses (subord_prop), their position relative Trace values for the PaWaC dataset. This suggests that, to the principal clause (subord_post), and their organialthough simplified sentences in the administrative do- zation into sequences of embedded subordinate clauses main tend to have higher READ-IT scores than those (avg_Schain_len). Among these, we can also include a from Wikipedia, the MANOVA results indicate that their feature from the verb inflectional morphology group that generation involves more substantial transformations, is closely related to reduced subordination: the lower possibly afecting multiple linguistic features, pointing distribution of subjunctives (aux_Sub). Additionally, feato more articulated simplification processes in this do- tures modeling both global and local aspects of syntactic main. Consequently, even the Least Simplified PaWaC tree structure vary significantly in both domains. These sentences display a more distinct linguistic profile com- include syntactic tree depth (tree_depth), indicative of senpared to their originals. tence complexity [36], as well as two features associated Feature-based Analysis. It is focused on the set of with long-distance dependencies, well-known sources of Randomly-selected Simplifications , which serve as rep- cognitive load [37, 38]: the length of the longest depenresentative examples of typical simplifications, as they dency link (links_len_max) and the number of embedded were randomly selected from the pool excluding the ex- sequences of prepositional complements (n_prep_chains). tremes. Specifically, we applied the Wilcoxon signed- A similar pattern is observed in the lower frequency of rank test (with < 0.05) to compare the distribution subjects and objects in non-canonical position occurof each feature between the original sentence and its ring in simplified sentences, specifically pre-verbal obcorresponding simplification. In addition, to quantify jects (obj_pre) and post-verbal subjects (subj_post), both the strength of the observed diferences, we computed known to be harder to process. On the lexical side, simplitheir rank-biserial correlation score [35], which ranges fied sentences in both domains exhibit a reduced proporbetween +1 (when the value of the feature occurring tion of lemmas from the highest frequency class (highin the original sentence is higher than in the simplified est_class). Interestingly, both domains display negative sentence) and − 1 (in the opposite case). By capturing scores for the distribution of auxiliary verbs (upos_AUX the efect size of the Wilcoxon test, the score reflects and dep_aux), indicating an increase in auxiliary usage in the magnitude of statistically significant distributional simplified versions. An in-depth analysis of verb forms diferences. Tables 5 and 6 show features with || ≥ 0.4 reveals that this may reflect a higher prevalence of ‘pasand their mean and standard deviation for the Wikipedia sato prossimo’ tenses (roughly present perfect tenses) and and PaWac domains8. a corresponding reduction of ‘passato remoto’ (roughly

Quite interestingly, a subset of the reported features simple pasts), particularly in Wikipedia. is shared across the two domains. This suggests that When focusing on features that vary significantly and these features correspond to linguistic phenomena highly with || ≥ 0.4 in only one domain, we find that they capture finer-grained phenomena. They predominantly 8The full list of features is reported in Appendix C. involve the distribution of specific verb tenses, such as present tense forms (*_Pres) in Wikipedia (whereas in to exhibit a relatively high level of linguistic complexity PaWaC they show only || = 0.15), and future (*_Fut) even after simplification (see Figure 1). It is therefore and imperfect (*_Imp) tenses in PaWaC (but not signifi- plausible that a surface-level transformation such as recantly varying in Wikipedia). A similar trend is observed ducing sentence length is less predictive of changes in for specific verb moods such as particles ( *_Part), which readability scores in this domain. This interpretation is vary above our threshold only in Wikipedia, and condi- also consistent with the MANOVA results, which inditionals (*_Cond), varying significantly in PaWaC. cate that simplified PaWaC sentences difer more substantially from their original versions across multiple 4.3. Linguistic Features and Readability linguistic features, suggesting a more articulated simpliifcation process.

As a third level of analysis, we investigated which lin- Among the top-ranked correlated features, we find guistic phenomena characterize automatically simplified several that, while sensitive to sentence length, also resentences in relation to the diferences in readability be- lfect deeper, linguistically motivated transformations intween the original and simplified versions. To this end, volved in the simplification process. This is the case of considering the Randomly-selected simplification , we com- the distribution of verbal heads (verbal_head_per_sent) puted Spearman correlations between the diferences in and of a subset of related features modeling the subthe distribution of the linguistic features, extracted us- ordination. These include: the overall distribution ing Profiling-UD, and the corresponding diferences in of subordinate clauses (subordinate_proposition_dist); their READ-IT scores. The results are reported in Ap- their organization in recursively embedded subordipendix B, where we compare the correlation scores for nate clause chains within a top-level subordinate clause the Wikipedia and PaWac domains. We focus on the set (avg_subordinate_chain_len_dif ); their relative order of linguistic features that show statistically significant with respect to the principal clause (subordinate_post), a correlations (i.e. < 0.05). characteristic associated with diferences in cognitive pro

As can be seen, most of the correlation scores are pos- cessing dificulty [ 39]; and a specific type of subordinate itive. This suggests that an increase in the diference clauses, i.e. relative clauses (dep_dist_acl:relcl), which are of specific linguistic features between original and sim- well-known sources of processing dificulty. In addition, plified sentences is often directly proportional to the we find two features related to long-distance construcincrease in their readability diference. This is the case, tions: the length of the longest dependency link in a for example, for the distribution of subordinate clauses sentence (max_links_len) and the number of embedded (subordinate_proposition) in both domains, which tend sequences of prepositional complements governed by a to be significantly reduced in the simplified sentences, nominal head (n_prepositional_chains). leading to lower syntactic complexity and, consequently, Focusing on lexical variation, the reduction in the proa lower READ-IT score. By contrast, the diference in the portion of lemmas belonging to the highest frequency distribution of auxiliary verbs (upos_dist_AUX ) shows class (highest_class) shows a positive correlation with a negative correlation with the diference in READ-IT readability improvement, particularly in PaWac ( = scores for both domains, as the distribution of auxiliaries 0.20) compared to Wikipedia ( = 0.16). Conversely, increases in the simplified sentences. a slight increase in the use of ‘high availability words’ Cross-Domain Correlation Patterns. When ranking (lower-frequency lemmas referring to everyday objects the linguistic features in decreasing order of correlation, or actions and well known to speakers), as identified in we observe that the most strongly correlated features the NBIV (in_AD_types), is negatively correlated in both are shared across both domains, despite diferences in domains. correlation scores. Notably, many of the top-ranked ones correspond to those discussed in the previous section. 4.4. Comparing Simplification This seems to support the hypothesis that the linguistic phenomena mostly involved in the transformations of Approaches original sentences are also those that have the greatest impact on sentence readability.

As expected, the most strongly correlated feature is sentence length (tokens_per_sent), which is considerably reduced in the simplified sentences. Interestingly, even if this pattern holds across both domains, the correlation is stronger for Wikipedia ( = 0.51) than for PaWac ( = 0.42). This seems to align with and complement the intuition that simplifying administrative texts is particularly challenging, as many of the PaWac sentences tend We complemented the linguistic profiling of the LLaMAntino-2–generated simplified sentences with a comparative analysis aimed at identifying whether certain linguistic phenomena are specific to the LLM-based approach to ATS resource construction or are shared across diferent simplification methodologies. To this end, we started from the findings of [ 15], who compared two Italian ATS resources created manually, “Teacher” and “Terence” [16], and one semi-automatically, PaCCSSIT [ 7 ], focusing on the distribution of a set of linguistic features comparable to those used in the present study. [15]. This aligns with observations about the insertion Our main goal is to assess whether some linguistic fea- of explicit arguments to reduce the inference load associtures are characteristic of simplified sentences regardless ated with null-subject constructions [40]. Interestingly, of the simplification method adopted. While prelimi- however, the tendency to favor the canonical Italian arnary, our results provide initial insights into whether an gument order, with subjects preceding the verb and obLLM-based method yields simplified sentences with char- jects following it, is not consistently observed across acteristics similar to those produced by human experts. resources. While unmarked word orders are generally

The first characteristic shared by sentences simplified preferred in simplification, as they are known to ease by both human experts and automatically generated con- processing in free word-order languages [41], a higher cerns their sentence length. Simplified sentences are proportion of pre-verbal subjects is found only in the always shorter than their original counterparts. This PaWac LLaMAntino-2-generated simplifications and in could be expected since sentence length has been con- the Teacher corpus. An even less consistent pattern sidered as a shallow proxy of sentence complexity and emerges for post-verbal objects, whose distribution difis widely used by traditional readability assessment for- fers across original and simplified sentences without a mulas. However, the diferent average length in original- systematic direction. simplified sentence pairs may difer according to textual genre, as shown in our analysis and discussed in [15].

A second group of features common to all ATS re- 5. Conclusion sources includes those modeling the morpho-syntactic profile of the simplified sentences 9. Similarly to manu- This study investigated the ability of small LLMs fineally and semi-automatically built simplifications, the sen- tuned on the Italian language to generate sentence simtences automatically generated by LLaMAntino-2 tend plifications in a zero-shot setting, focusing on two linto contain fewer pronouns, adverbs, and punctuation guistically distinct domains: Wikipedia and Public Admarks, and a higher proportion of determiners. However, ministration. All tested models were able to produce simin contrast to the findings reported in [ 15], which were plified sentences that preserved the surface-level propalso based on the Wilcoxon signed-rank test ( < 0.05), erties and semantic content of the original inputs while the LLM-generated simplified sentences exhibit a higher improving readability. Among them, LLaMAntino-2 confrequency of nouns, and the variation in the distribution sistently outperformed the other models across all evalof adjectives compared to the original sentences is not uation metrics. Beyond single-sentence simplification, statistically significant. We leave to future work the in- we also showed that prompting the model to generate vestigation of whether this trend may be influenced by multiple outputs for the same input sentence results in a the textual genre of the original sentences. meaningful gradient of linguistic complexity.

Among the features common across approaches, we Domain-specific analyses revealed that, although simifnd those capturing global and local syntactic structure. plified sentences in the administrative domain remain As also observed in Section 4.2, simplified sentences tend less accessible than their Wikipedia counterparts, simto have shallower syntactic trees and shorter dependency plifying administrative texts involves more substantial links, suggesting that reducing syntactic depth and de- linguistic transformations, as suggested by MANOVA pendency length is a broadly adopted simplification strat- results, thus pointing to more complex simplification egy. However, when examining finer-grained syntactic strategies in this domain. These findings highlight the properties, some diferences emerge. A first example potential of this approach to support the development concerns the use of subordination. While previous stud- of ATS resources tailored to specific reader profiles and ies suggest that subordinate clauses following the main domains. Despite a few cross-domain diferences, our clause are easier to process [39], only the “Terence” cor- analysis of the linguistic features most afected by simpus and PaCCSS-IT show a higher percentage of post- plification shows that many transformations are shared verbal subordinates. By contrast, an opposite trend is across domains and closely align with known simplificaobserved in the sentences automatically generated by tion patterns found in manually constructed ATS corpora. LLaMAntino-2 as well as in the manually built “Teacher” These findings support two key directions for future corpus, where post-verbal subordinates are less frequent. work. First, the generation of synthetic simplifications A second example is the distribution of subjects. All re- using small, language-specific LLMs ofers a promising sources show an increased presence of overt subjects in method for building ATS resources in low-resource setsimplified sentences, particularly in the “Teacher” cor- tings. Second, the linguistic properties characterizing pus, representing an intuitive manual simplification in LLM-generated simplifications can inform Controllable Text Generation approaches [42], enabling models to be guided toward specific simplification strategies aligned with the needs of diferent reader populations. 9The values of some linguistic features are not reported in Tables 6 and 5, as their rank-biserial correlation scores are || ≤ 0.4.

Acknowledgments

This work has been supported by the project “XAI-CARE” funded by the European Union - Next Generation EU NRRP M6C2 “Investment 2.1 Enhancement and strengthening of biomedical research in the NHS” (PNRR-MAD2022-12376692_VADALA’ – CUP F83C22002470001) and by the PRIN 2022 project TEAMING-UP - Teaming up with Social Artificial Agents (20177FX2A7) funded by the Italian Ministry of University and Research.

18653/v1/2024.findings-acl.658. for sentences in Italian administrative language, [12] T. Kew, A. Chi, L. Vásquez-Rodríguez, S. Agrawal, in: Y. He, H. Ji, S. Li, Y. Liu, C.-H. Chang (Eds.), D. Aumiller, F. Alva-Manchego, M. Shardlow, Proceedings of the 2nd Conference of the AsiaBLESS: Benchmarking large language models on Pacific Chapter of the Association for Computasentence simplification, in: H. Bouamor, J. Pino, tional Linguistics and the 12th International Joint K. Bali (Eds.), Proceedings of the 2023 Conference Conference on Natural Language Processing (Volon Empirical Methods in Natural Language Pro- ume 1: Long Papers), Association for Computacessing, Association for Computational Linguis- tional Linguistics, Online only, 2022, pp. 849–866. tics, Singapore, 2023, pp. 13291–13309. URL: https: URL: https://aclanthology.org/2022.aacl-main.63/. //aclanthology.org/2023.emnlp-main.821/. doi:10. doi:10.18653/v1/2022.aacl-main.63. 18653/v1/2023.emnlp-main.821. [21] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: [13] D. Nozza, G. Attanasio, Is it really that simple? a method for automatic evaluation of machine prompting large language models for automatic translation, in: P. Isabelle, E. Charniak, D. Lin text simplification in Italian, in: F. Boschetti, G. E. (Eds.), Proceedings of the 40th Annual Meeting of Lebani, B. Magnini, N. Novielli (Eds.), Proceedings the Association for Computational Linguistics, Asof the 9th Italian Conference on Computational sociation for Computational Linguistics, PhiladelLinguistics (CLiC-it 2023), CEUR Workshop Pro- phia, Pennsylvania, USA, 2002, pp. 311–318. URL: ceedings, Venice, Italy, 2023, pp. 322–333. URL: https://aclanthology.org/P02-1040/. doi:10.3115/ https://aclanthology.org/2023.clicit-1.39/. 1073083.1073135. [14] M. Russodivito, V. Ganfi, G. Fiorentino, R. Oliveto, [22] W. Xu, C. Napoles, E. Pavlick, Q. Chen, C. CallisonAI vs. human: Eefctiveness of LLMs in simplifying Burch, Optimizing statistical machine translation Italian administrative documents, in: F. Dell’Orletta, for text simplification, Transactions of the AsA. Lenci, S. Montemagni, R. Sprugnoli (Eds.), Pro- sociation for Computational Linguistics 4 (2016) ceedings of the 10th Italian Conference on Compu- 401–415. URL: https://aclanthology.org/Q16-1029/. tational Linguistics (CLiC-it 2024), CEUR Workshop doi:10.1162/tacl_a_00107.

Proceedings, Pisa, Italy, 2024, pp. 842–853. URL: [23] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, https://aclanthology.org/2024.clicit-1.91/. Y. Artzi, Bertscore: Evaluating text generation [15] D. Brunato, F. Dell’Orletta, G. Venturi, with bert, in: International Conference on LearnLinguistically-based comparison of difer- ing Representations, 2020. URL: https://openreview. ent approaches to building corpora for text net/forum?id=SkeHuCVFDr. simplification: A case study on italian, Fron- [24] N. Reimers, I. Gurevych, Sentence-bert: Sentence tiers in Psychology Volume 13 - 2022 (2022). embeddings using siamese bert-networks, in: ProURL: https://www.frontiersin.org/journals/ ceedings of the 2019 Conference on Empirical Methpsychology/articles/10.3389/fpsyg.2022.707630. ods in Natural Language Processing, Association doi:10.3389/fpsyg.2022.707630. for Computational Linguistics, 2019. URL: https: [16] D. Brunato, F. Dell’Orletta, G. Venturi, S. Monte- //arxiv.org/abs/1908.10084.

magni, Design and annotation of the first Italian [25] N. Reimers, I. Gurevych, Making monolingual sencorpus for text simplification, in: A. Meyers, I. Re- tence embeddings multilingual using knowledge hbein, H. Zinsmeister (Eds.), Proceedings of the distillation, in: Proceedings of the 2020 Conference 9th Linguistic Annotation Workshop, Association on Empirical Methods in Natural Language Profor Computational Linguistics, Denver, Colorado, cessing, Association for Computational Linguistics, USA, 2015, pp. 31–41. URL: https://aclanthology. 2020. URL: https://arxiv.org/abs/2004.09813. org/W15-1604/. doi:10.3115/v1/W15-1604. [26] F. Dell’Orletta, S. Montemagni, G. Venturi, READ– [17] M. Polignano, P. Basile, G. Semeraro, Advanced IT: Assessing readability of Italian texts with a view natural-based interaction for the italian language: to text simplification, in: N. Alm (Ed.), Proceedings Llamantino-3-anita, 2024. arXiv:2405.07101. of the Second Workshop on Speech and Language [18] P. Basile, E. Musacchio, M. Polignano, L. Siciliani, Processing for Assistive Technologies, Association G. Fiameni, G. Semeraro, Llamantino: Llama 2 mod- for Computational Linguistics, Edinburgh, Scotland, els for efective text generation in italian language, UK, 2011, pp. 73–83. URL: https://aclanthology.org/ 2023. arXiv:2312.09993. W11-2308/. [19] S. Tonelli, A. P. Aprosio, F. Saltori, Simpitiki: a [27] L. C. Passaro, A. Lenci, PaWaC - Public Administrasimplification corpus for italian, Proceedings of tion Web as Corpus (Processed), http://data.europa.

CLiC-it (2016). eu/88u/dataset/elrc_1282, 2019. [Data set]. [20] M. Miliani, S. Auriemma, F. Alva-Manchego, [28] M. Cortelazzo, Il linguaggio amministrativo: prin

A. Lenci, Neural readability pairwise ranking cipi e pratiche di modernizzazione, Carocci, 2021. [29] A. K. Vijayakumar, M. Cogswell, R. R. Sel- pora as evidence for theories of syntactic processing varaju, Q. Sun, S. Lee, D. J. Crandall, D. Ba- complexity, Cognition 109 (2008) 193–210. tra, Diverse beam search: Decoding diverse [39] J. Miller, R. Weinert, Spontaneous spoken language. solutions from neural sequence models, CoRR Syntax and discourse, Oxford University Press, abs/1610.02424 (2016). URL: http://arxiv.org/abs/ 1998.

1610.02424. arXiv:1610.02424. [40] G. Barlacchi, S. Tonelli, Ernesta: A sentence sim[30] D. Brunato, A. Cimino, F. Dell’Orletta, G. Venturi, plification tool for children’s stories in italian, in: S. Montemagni, Profiling-UD: a tool for linguis- Computational Linguistics and Intelligent Text Protic profiling of texts, in: N. Calzolari, F. Béchet, cessing: 14th International Conference, CICLing P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, 2013, Springer Berlin Heidelberg, 2013, pp. 476–487. H. Isahara, B. Maegaard, J. Mariani, H. Mazo, [41] M. HASPELMATH, Against markedness (and what A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings to replace it with), Journal of Linguistics 42 (2006) of the Twelfth Language Resources and Evaluation 25–70. doi:10.1017/S0022226705003683. Conference, European Language Resources Associ- [42] Z. Li, M. Shardlow, How do control tokens afect ation, Marseille, France, 2020, pp. 7145–7151. URL: natural language generation tasks like text simplihttps://aclanthology.org/2020.lrec-1.883/. ifcation, Natural Language Engineering 30 (2024) [31] M.-C. De Marnefe, C. D. Manning, J. Nivre, D. Ze- 915–942. doi:10.1017/S1351324923000566. man, Universal dependencies, Computational linguistics 47 (2021) 255–308. [32] T. De Mauro, I. Chiari, Il nuovo vocabolario di base della lingua italiana, Internazionale [accessed on 03/03/2023] (2016). URL: https://www.internazionale. it/opinione/tullio-de-mauro/2016/12/23/ il-nuovo-vocabolario-di-base-della-lingua-italiana. [33] G. Sarti, M. Nissim, IT5: Text-to-text pretraining for

Italian language understanding and generation, in: N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LRECCOLING 2024), ELRA and ICCL, Torino, Italy, 2024, pp. 9422–9433. URL: https://aclanthology.org/2024.

lrec-main.823. [34] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al

Rfou, A. Siddhant, A. Barua, C. Rafel, mT5: A massively multilingual pre-trained text-to-text transformer, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, 2021, pp. 483–498.

URL: https://aclanthology.org/2021.naacl-main.41.

doi:10.18653/v1/2021.naacl-main.41. [35] H. W. Wendt, Dealing with a common problem in social science: A simplified rank-biserial coeficient of correlation based on the statistic., European J. of

Social Psychology (1972). [36] L. Frazier, Syntactic complexity, in: D. Dowty,

L. Karttunen, A. Zwicky (Eds.), Natural Language Parsing, Cambridge University Press, Cambridge,

UK, 1985. [37] E. Gibson, Linguistic complexity: Locality of syn

tactic dependencies, Cognition 24 (1998) 1–76. [38] V. Demberg, F. Keller, Data from eye-tracking cor

A. Prompt Template for Sentence Simplification

Each model was prompted using its respective system prompt provided in the Hugging Face documentation. We also provided a task-specific prompt to instruct the model to perform the Sentence Simplification task. The following prompt pattern was used: # # # I s t r u z i o n e : S e m p l i f i c a l a s e g u e n t e f r a s e mantenendo i l p i ù p o s s i b i l e i n t a t t o i l s i g n i f i c a t o . # # # I n p u t : { o r i g i n a l _ s e n t e n c e } # # # O u t p u t : English translation: “Instruction: Simplify the following sentence while keeping the meaning the same as much as possible.”.

B. Linguistic Features and Readability Correlation Heatmap C. Linguistic Features of Original and Simplified Sentences

During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Paraphrase and reword, Improve writing style, and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

[1]

Alva-Manchego ,

Scarton , L. Specia, Datadriven sentence simplification: Survey and benchmark , Computational Linguistics 46 ( 2020 ) 135 - 187 . URL: https://aclanthology.org/ 2020 .cl- 1 .4/. doi: 10 . 1162/coli_a_ 00370 .

[2]

M. J.

Ryan ,

Naous , W. Xu, Revisiting non-English text simplification: A unified multilingual benchmark , in: A. Rogers , J. Boyd-Graber , N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Toronto, Canada, 2023 , pp. 4898 - 4927 . URL: https://aclanthology.org/ 2023 . acl-long . 269 /. doi: 10 .18653/v1/ 2023 . acl-long . 269 .

[3]

Kauchak , Improving text simplification language modeling using unsimplified text data , in: H. Schuetze , P. Fung , M. Poesio (Eds.), Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Sofia, Bulgaria, 2013 , pp. 1537 - 1546 . URL: https://aclanthology.org/P13-1151/.

[4]

Pellow ,

Eskenazi , An open corpus of everyday documents for simplification tasks , in: S. Williams , A. Siddharthan , A . Nenkova (Eds.), Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR), Association for Computational Linguistics , Gothenburg, Sweden, 2014 , pp. 84 - 93 . URL: https://aclanthology.org/W14-1210/. doi: 10 . 3115/v1/ W14 -1210.

[5]

Alva-Manchego ,

Martin ,

Bordes ,

Scarton ,

Sagot , L. Specia, ASSET: A dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations , in: D. Jurafsky , J.

Chai , N.

Schluter , J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , Online, 2020 , pp. 4668 - 4679 . URL: https://aclanthology.org/ 2020 .acl-main. 424 /. doi: 10 .18653/v1/ 2020 .acl-main. 424 .

[6]

Xu ,

Callison-Burch , C.

Napoles, Problems in current text simplification research: New data can help, Transactions of the Association for Computational Linguistics 3 (

2015 ) 283 - 297 . URL: https://aclanthology.org/Q15-1021/. doi: 10 .1162/ tacl_a_ 00139 .

[7]

Brunato ,

Cimino ,

Dell'Orletta , G. Venturi, PaCCSS-IT: A parallel corpus of complexsimple sentences for automatic text simplification , in: J. Su , K. Duh , X. Carreras (Eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Austin, Texas, 2016 , pp. 351 - 361 . URL: https://aclanthology.org/D16-1034/. doi: 10 .18653/v1/ D16 -1034.

[8]

Martin ,

Fan , É. de la Clergerie,

Bordes ,

Sagot , MUSS: Multilingual unsupervised sentence simplification by mining paraphrases , in: N. Calzolari , F.

Béchet , P.

Blache , K.

Choukri , C.

Cieri , T.

Declerck , S.

Goggi , H.

Isahara , B.

Maegaard , J.

Mariani , H.

Mazo , J.

Odijk , S. Piperidis (Eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference , European Language Resources Association, Marseille, France, 2022 , pp. 1651 - 1664 . URL: https://aclanthology.org/ 2022 .lrec- 1 .176/.

[9]

Palmero Aprosio ,

Tonelli ,

Turchi ,

Negri ,

M. A.

Di Gangi , Neural text simplification in low-resource conditions using weak supervision , in: A. Bosselut , A.

Celikyilmaz , M.

Ghazvininejad , S.

Iyer , U. Khandelwal, H.

Rashkin , T. Wolf (Eds.), Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation , Association for Computational Linguistics, Minneapolis, Minnesota, 2019 , pp. 37 - 44 . URL: https://aclanthology.org/W19-2305/. doi: 10 . 18653/v1/ W19 -2305.

[10]

Miliani ,

Alva-Manchego ,

Lenci , Simplifying administrative texts for Italian L2 readers with controllable transformers models: A data-driven approach , in: F. Boschetti,

G. E.

Lebani ,

Magnini , N. Novielli (Eds.), Proceedings of the 9th Italian Conference on Computational Linguistics (CLiC-it 2023 ), CEUR Workshop Proceedings, Venice, Italy, 2023 , pp. 303 - 315 . URL: https://aclanthology.org/ 2023 .clicit- 1 .37/.

[11]

Long ,

Wang ,

Xiao ,

Zhao ,

Ding , G. Chen,

Wang , On LLMs-driven synthetic data generation, curation, and evaluation: A survey , in: L. -W. Ku , A. Martins , V. Srikumar (Eds.), Findings of the Association for Computational Linguistics: ACL 2024 , Association for Computational Linguistics , Bangkok, Thailand, 2024 , pp. 11065 - 11082 . URL: https: //aclanthology.org/ 2024 .findings-acl. 658 /. doi: 10.