1. Introduction

Segmenting Italian Sentences for Easy Reading

Marta Cozzini

Horacio Saggion

0 0 Universitat Pompeu Fabra , Carrer de la Mercè 12, Ciutat Vella, 08002 Barcelona , Spain 1 Università di Bologna , Via Zamboni 33, 40126 Bologna , Italy

2025

Easy Read texts are essential for individuals with reading dificulties. These texts are developed according to institutional guidelines that establish clear rules for writing and structuring content in an accessible way. A key feature of Easy Read texts is the segmentation of sentences into smaller grammatical units, often presented on separate lines, to enhance readability. While several studies have addressed content simplification in easy-to-read materials, much less attention has been paid to the automatic segmentation of such texts. This project investigates whether this kind of segmentation can be automated in a reliable and eficient way, even with limited resources. The main goal is to develop and evaluate automatic methods for splitting texts into simpler, shorter units to support text simplification and improve overall readability. The methods developed and evaluated are a decision tree classifier and a prompting-based method using a large language model (LLM). The work is focused on Italian, and the application of these methodologies to this language represents a novel contribution.

eol>Text simplification easy-to-read automatic segmentation ER resources CLiC-it

1. Introduction

enhancing text comprehensibility.

Easy-to-read materials are important to ensure that as The Inclusion Europe guidelines advise many people as possible can access information, espe- against writing: cially people with cognitive disabilities, who might find it harder to understand complex texts or learn new things. Il modo in cui questa frase è These specific materials follow shared guidelines de- divisa non è facile da leggere. signed to make reading and understanding easier thanks to clear and consistent writing. Inclusion Europe cre- Instead, they recommend: ated easy-to-read standards for preparing this kind of content in diferent languages [ 1]. Although these guide- Il modo in cui questa frase è divisa lines were originally designed for people with cognitive è facile da leggere. dificulties, they’re also helpful for others, such as nonnative speakers or anyone who finds reading challeng- From a linguistic perspective, the first version interrupts ing. Among the various recommendations, particular a verbal phrase composed of the auxiliary “è” and the past attention is paid to the use of simple vocabulary, short participle “divisa.” This separation breaks the syntactic sentences, and a clear logical structure. Some guidelines and semantic unity of the clause, making the sentence also emphasize the importance of dividing the text into harder to process. By splitting these tightly connected smaller grammatical units to improve readability. The In- elements across two lines, the reader’s comprehension clusion Europe guidelines state that each sentence should efort increases. As the guidelines suggest, such breaks ideally fit on a single line and that longer sentences should be avoided in order to maintain clarity and facilishould be split at natural linguistic boundaries: where tate understanding. people would pause when reading out loud. This atten- Despite the growing interest in text simplification, the tion to segmentation is not only important for proper text task of sentence segmentation in easy-to-read materilayout, but, as the guidelines suggest and the following als remains largely underexplored. Currently, there are example demonstrates, it also plays a significant role in very few resources that address easy-to-read principles in relation to automatic segmentation, and only a limited CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- number of studies have investigated how segmentation tics, September 24 — 26, 2025, Cagliari, Italy can be implemented computationally within this frame* Corresponding author. work. This work aims to fill this gap by exploring whether † These authors contributed equally. segmentation can be automated reliably and eficiently. $ marta.cozzini@studio.unibo.it (M. Cozzini); In particular, we evaluate two approaches: a decision horacio.saggion@upf.edu (H. Saggion) tree classifier and a prompting-based method using a (H.0S0a0g9g-0io0n0)4-1992-9132 (M. Cozzini); 0000-0003-0016-7807 large language model (LLM). Both models are tested on © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License easy-to-read materials that we collected from sources Attribution 4.0 International (CC BY 4.0). that we consider particularly trustworthy in adhering 2.1. Sentence Segmentation to oficial ER guidelines. These very materials not only serve as the basis for our evaluation, but also represent a Sentence segmentation is particularly valuable for cresecondary contribution of this study, as they form two ating accessible materials for individuals with reading new corpora that can support future research not only on challenges. Line breaks strategically inserted within long segmentation, but more broadly in the domain of Italian sentences can significantly improve readability [4]. The text simplification. Although they do not include origi- core concept behind sentence segmentation for easy readnal–simplified text pairs, they ofer quality examples of ing materials is the division of complex sentences into simplified texts segmented according to established ER smaller, more digestible chunks. This segmentation must criteria. follow "natural linguistic boundaries, ending at a position

The paper is structured as follows: Section 2 reviews in the sentence where a reader would naturally pause" related work, followed by Section 3 that introduces the [5]. While intuitive to understand, defining precise cricorpora used in our experiments, discussing their sources, teria for these natural boundaries remains challenging. the methodology behind their creation as easy-to-read Recent research has explored the optimal approach to materials, and other relevant details. Section 4 provides sentence splitting for improved comprehension. Studies a detailed description of our methodology for the seg- have found that dividing sentences does enhance readabilmentation task, including both the decision tree and the ity, with a particular finding that bisecting the sentence prompting approaches. Section 5 presents our experi- leads to enhanced readability to a degree greater than mental setup, while Section 6 analyzes the results, eval- when we create simplification by trisection [ 6][7]. This uating each method, comparing their performance, and preference for two-sentence splits over three-sentence providing insights into the findings. Finally, Sections 7 divisions has been confirmed through Bayesian modeland 8 conclude the paper by discussing key takeaways, ing experiments using various linguistic and cognitive addressing limitations, and describing future research features [8]. For readers with learning dificulties, proper directions. sentence segmentation is particularly valuable. Studies have found that sentence density is a significant negative predictor of inferential comprehension, meaning that 2. Related Work "the higher the sentence density, the lower the ability of these students to find relationships between them" [ 9].

Text segmentation plays an important role in promot- This finding underscores the importance of appropriate ing textual accessibility and can be considered a relevant text segmentation for enhancing comprehension among component of both Automatic Text Simplification (ATS) diverse reader populations. and the development of easy-to-read materials. ATS is a Natural Language Processing (NLP) task aimed at reduc- 2.2. Automatic Sentence Segmentation ing linguistic complexity of texts, while preserving their original meaning [2]. It may involve modifications at Despite increasing interest in text simplification, the spethe lexical, syntactic, or discourse level. In recent years, cific task of automatic sentence segmentation in the conresearch on ATS has focused on developing approaches text of easy-to-read (ER) materials remains largely underto simplify and adapt texts for individuals with cognitive explored. To our knowledge, only one study to date has disabilities or language impairments [3]. While ATS re- directly investigated how segmentation can be computalies on computational strategies, easy-to-read materials tionally implemented within this framework [5], and curare instead based on institutional guidelines that define rently, very few resources address ER principles in relaclear rules for structuring content in an accessible way. tion to automatic segmentation. However, segmentation These two approaches often converge on similar features plays a crucial role in related domains, most notably in that enhance readability. These include the use of simple subtitle generation, where readability is enhanced when vocabulary and grammar, short sentences, a clear logical subtitles are segmented at naturally occurring linguistic structure, and the explanation of complex concepts in boundaries, in addition to meeting timing and space consimpler terms. Within both frameworks, text segmen- straints. Research has shown that subtitle segmentation tation is frequently emphasized: each sentence should has a significant impact on readability [ 10], leading to ideally fit on a single line, and if this is not feasible, it the development of various computational approaches. should be split at natural linguistic boundaries to enhance For instance, Álvarez et al. [11] trained Support Vector clarity and facilitate comprehension. Machine and Linear Regression models on professionally created subtitles to predict optimal subtitle breaks, later improving this method through the use of Conditional Random Fields [12]. These supervised approaches could, in principle, be adapted to ER settings, provided that sufifcient annotated training data is available. Nonetheless, segmentation: whenever the page layout allowed, each compared to subtitling, resources for ER segmentation line was designed to contain a complete unit of meanare extremely limited. As we mentioned before, to date, ing. If it was not possible to keep a sentence on a single only one study has directly addressed the problem of sen- line, sentence breaks were carefully managed to avoid tence segmentation for the generation of ER texts. This arbitrary line breaks, with each line always ending on a work explores multiple approaches, including the use of whole word and words were never split across lines. This generative large language models (LLMs) under difer- careful approach to segmentation shows an understandent prompting modalities and a scoring-based method ing of its efect on readability, emphasizing that sentence compatible with both constituency parsing and masked splitting should be deliberate and meaningful, unlike the language modeling (MLM). In addition, it tackles the prob- more random breaks often seen in standard newspapers. lem of data sparsity by developing new segmentation- The final corpus contains 311 articles, comprising 4855 centric datasets for Basque, English, and Spanish, thus sentences. Each article was saved as a separate plain laying the groundwork for further research in this do- text file with a .txt extension. All files were encoded in main [5]. UTF-8, with special characters and HTML tags removed

As the first study to focus specifically on automatic during preprocessing to ensure a clean and consistent sentence segmentation within the context of ER materi- textual format. The articles are organized in a hierarals, it has provided a valuable foundation for our work. chical folder structure reflecting the original metadata: Building on its insights, we aim to apply similar strate- first by publication year, then by month, and finally by gies to address the problem of sentence segmentation magazine section (e.g., "sport", "cultura"). This structure for Italian, a language for which text simplification re- reflects the original editorial organization and allows for sources and research remain scarcer compared to English easy filtering by date or topic. or Spanish.

3. Corpora

To construct our corpora, we relied on two diferent websites: Due Parole [13] and Anfas [ 14], known for their adherence to ER guidelines. From each source, we created a separate corpus, which was later used in our experiments. We describe these corpora in more detail in the following subsections.

3.1. Corpus from Due Parole

On the Due Parole website, we accessed the online archive of Due Parole, an Italian easy-to-read magazine that was published, with some interruptions, between 1989 and 2006. The magazine was specifically designed to provide accessible information to a broad audience, with simplified texts created by a team of linguists, journalists, and teachers from the University of Rome ’La Sapienza’. The corpus collected from this source consists exclusively of magazine articles, providing a consistent and well-structured textual base for training and initial testing of our models. From the online archive of Due Parole, we collected only the articles available in digital format, as web scraping was necessary to build the corpus. During the web scraping process, we preserved all original line breaks present in the formatted texts as published online. To ensure that the Due Parole corpus complied with the Inclusion Europe guidelines, we referred to Piemontese[15], which outlines the guidelines followed by the Due Parole team when producing easy-to-read texts. Some of the key recommendations concerned text

3.2. Corpus from Anfass Our second source of easy-to-read materials is the

website of Anfas, a national association of families of individuals with intellectual and/or relational disabilities.

Anfas was one of the partners involved in the project that led to the definition of the European easy-to-read Guidelines [16]. Therefore, we can expect that the texts in the section “Documenti facili da leggere” ("Easy-to-read documents") that we can find on the website follow these oficial guidelines. From all the easy-to-read materials published there, we selected only the texts included in the easy-to-read magazine ’A modo mio’. This choice was motivated by the need to align with the other corpus, which also consisted exclusively of magazine articles. The Anfas corpus was used exclusively as a test set. Unlike the Due Parole corpus, creating this corpus as plain text was more dificult because the texts were only available in PDF format, which ruled out the use of web scraping. We therefore had to convert them manually. However, because there were significantly fewer Anfas texts compared to Due Parole, this operation did not require too much time.

Similarly to Due Parole, we preserved all original line breaks present in the formatted texts as published online. The final corpus contains 38 articles comprising 481 sentences. The articles are organized into folders corresponding to each magazine issue, labeled by month and year. Within each issue folder, there is one plain text ifle ( .txt) per magazine section (e.g., "sport", "spettacoli e televisione").

Table 1 summarizes the statistics of our corpora, including the total number of sentences, the number of in a sentence, labeled 0), and 0 otherwise. These labels sentences that contain at least one segmentation point, serve as the target outputs that the model is trained and the number of sentences without any segmentation. to predict. Only after creating these target labels, the <seg> markers were removed, and the cleaned sentences Table 1 reconstructed and re-tokenized with spaCy to prepare Corpora statistics the data for further processing (see step 4). The following example illustrates the prepocessing steps applied to our Sentences Due Parole Anfass corpus before training the decision tree model: Total With segmentation Without segmentation

From Table 1, we observe the diferences between the two corpora in terms of segmented sentences. Specifically, in the Due Parole corpus, the number of sentences containing at least one segmentation point is 4,271, corresponding to 88% of the total sentences. In contrast, this percentage drops to 42% in the Anfass corpus. This discrepancy is expected to afect the performance of our decision tree model, which was trained on the Due Parole corpus and subsequently tested on the Anfass corpus, as we will see in Section 6.

4. Methodology To explore the viability of automatic text segmentation

in low-resource settings, we adopted two diferent approaches: a traditional machine learning method informed by linguistic features (a decision tree) [17] and a current prompting Large Language Model approach.

4.1. Automatic Segmentation Using Decision Tree

We first approached the task of automatic text segmentation as a binary classification problem. In this framework, the model is trained to assign a binary label, 0 or 1, to each token in the input text, where 1 indicates that a segmentation should occur immediately after that token, while 0 means no segmentation. To build the training data, we started from raw texts extracted from the Due Parole dataset, that we described in the previous section. We first segmented the texts into sentences using spaCy’s sentence tokenizer. Before sentence segmentation, we replaced all new line characters (\n) occurring within the text with a special marker <seg>, in order to preserve formatting information for subsequent processing (see step 2 of the example below). We then used the <seg> markers to split each sentence into smaller chunks, corresponding to the original internal line breaks (as shown in step 3). These splits helped us identify potential segmentation points within the sentence. For each token in the sentence, we assigned a binary label: 1 if it ended a chunk (except the final chunk 1. Original input

This example sentence, extracted from the raw text, will be used to illustrate the prepocessing steps. Note that at this stage of the pipeline, the sentence is provided for demonstration purposes only, as the original text has not yet been segmented into sentences. In this example, newline characters indicate editorial line breaks.: La Costituzione è l’insieme delle leggi più importanti della Repubblica italiana. 2. Intermediate representation

In this intermediate form, the raw text is segmented into sentences, and newline characters are replaced with a special segmentation marker:

La Costituzione è l’insieme <seg> delle leggi più importanti <seg> della Repubblica italiana. 3. Segmented output

The text is then split into segments at the positions marked by the <seg> tokens, which serve to identify potential segmentation boundaries: ['La Costituzione è l’insieme', 'delle leggi più importanti', 'della Repubblica italiana.'] 4. Linguistic analysis

Finally, the reconstructed sentence is used for token-level feature extraction in the classification model:

La Costituzione è l’insieme delle leggi più importanti della Repubblica italiana.

After reconstructing the sentences, we performed feature extraction, including token-level features such as part-of-speech (POS) tags, sentence length (in tokens and characters), token length (in characters), and the token’s position within the sentence. We converted POS tags into binary features using one-hot encoding. Then, all the features and target labels were organized into a tabular structure. A decision tree classifier was then trained on these data to predict segmentation. 4.2. Generative LLM Segmentation confini grammaticali naturali. Ogni segmento di testo dovrebbe contenere tra le 5 e le 15 parole. Il contenuto della frase originale deve essere mantenuto rigorosamente; pertanto non deve essere aggiunta nuova informazione di alcun tipo. Scrivi ogni segmento su una nuova riga, senza numerazione o simboli all’inizio. Non generare altro testo ad eccezione del testo originale segmentato.

Our second approach to automatic text segmentation involved using an instruction-tuned large language model 5. Experiments (LLM) with zero-shot prompting. The design of our prompts was based on both the prompt strategies pro- Our first approach to automatic sentence segmentation posed in Calleja et al.[5] and the recommendations out- was based on a traditional machine learning model. In lined in the Inclusion Europe easy-to-read (ER) guide- particular, we employed a decision tree Classifier implelines. Following the approach of Calleja et al. [5], we de- mented via the DecisionTreeClassifier class in the signed two separate prompts. The first prompt (Prompt sklearn.tree Python library [18]. To ensure replica1) aligns with the formal Inclusion Europe guidelines bility of our results, we set the random_state. Addithat state "tagliate la frase lì dove le persone farebbero tionally, we configured the classifier with the parameuna pausa leggendo la frase a voce alta" [1], while the ter class_weight=’balanced’, which automatically second (Prompt 2) relies on the identification of natu- adjusts weights inversely proportional to the class freral grammatical boundaries. Unlike Prompt 1, Prompt 2 quencies in the input data. This choice was motivated by avoids explicit mentions of reading pauses, which could the significant imbalance in our dataset, where the target be less accessible or meaningful to the model. To make label 1 (indicating a segmentation point) is much less the prompts more specific, we introduced an additional frequent than label 0 (no segmentation). To reduce the constraint on the length of the segment, specifying that negative impact of this imbalance on model performance each segment should contain between 5 and 15 words. we adopted this built-in balancing strategy provided by As is standard when prompting LLMs, we added also scikit-learn. explicit instructions to ensure that the model would only For the prompting experiments, we used Gemma 2 output the requested content, without generating any 9b, part of Google’s Gemma family of lightweight, stateadditional text. In particular, we specified that the model of-the-art decoder-only large language models. A key should not include numbers, symbols, or bullet points at advantage of this family is the relatively small model size the beginning of lines, as our preliminary tests revealed and the availability of open weights, which make the a tendency to introduce such formatting elements. models suitable for deployment in resource-limited environments such as laptops or personal cloud infrastruc• Prompt 1: Dividi la seguente frase ture. We loaded the model and tokenizer via the Hugging in segmenti separati, inserendo un Face Transformers library, employing automatic device ritorno a capo dove le persone mapping and bfloat16 precision for eficient inference. farebbero una pausa leggendo la frase Text generation was performed with controlled sampling ad alta voce. Ogni segmento di testo parameters: a maximum of 150 new tokens, temperature dovrebbe contenere tra le 5 e le 15 set to 0.7, and nucleus sampling top_p at 0.9. parole. Il contenuto della frase The decision tree classifier was initially trained and originale non deve essere alterato tested on a portion of the Due Parole corpus (see Table in nessun modo; pertanto non deve 2), allowing an initial evaluation of its performance. essere aggiunta nuova informazione di Subsequently, to assess the model’s behavior on diferent alcun tipo. Scrivi ogni segmento types of texts, the decision tree was also tested on su una nuova riga, senza numerazione the Anfas corpus. At the same time, the LLM-based o simboli all’inizio. Non generare segmentation approach was applied exclusively to altro testo ad eccezione del testo sentences from the Anfas corpus, in order to ensure originale segmentato. that the results produced by the decision tree and the LLM would be directly comparable. As will be explained in more detail below, applying the same evaluation • Prompt 2: Dividi la seguente frase in

segmenti separati, che rispettino i procedure to the Due Parole test set would have required in Table 4, the results difer substantially: the model excluding a substantial portion of the data, potentially performs notably worse. This performance drop can be biasing the results. attributed to the mismatch between the training data and the new test data. Although both corpora adhere

Table 2 shows the distribution of the Due Parole corpus to the Inclusion Europe guidelines and both consist of across the training, validation, and test sets. magazine articles, the texts in the Due Parole corpus exhibit a more uniform structure, largely influenced Table 2 by the magazine’s fixed layout. In contrast, the Anfas Data partition statistics (number of tokens) ’A modo mio’ texts, while also published in magazine format, feature a more variable graphic layout, which

Due Parole may have afected the model’s ability to generalize.

Train 64252 Another contributing factor is the discrepancy in the Validation 7140 proportion of segmented sentences between the two Test 7933 corpora that we described in 3.2: while Due Parole contains 88% of sentences with at least one segmentation point, this percentage drops to only 42% in Anfass. This results in fewer positive instances (i.e., target variable = 6. Results 1) in the Anfass corpus, which further contributes to the already critical issue of target variable imbalance. This To evaluate the performance of our approaches, we relied imbalance, as discussed earlier, consistently influences on standard metrics commonly used in binary classifica- model performance both on the Due Parole test set tion tasks, such as precision, recall, and F1-score. These and on the Anfass corpus, as reflected in the results metrics provide a comprehensive overview of model ef- tables. It notably afects the model’s ability to correctly fectiveness, particularly in scenarios with imbalanced identify the minority class (label 1), which corresponds classes. to segmentation points, resulting in lower precision, recall, and F1 scores. This trend is especially visible in 6.1. Decision Tree Evaluation the results obtained on the Anfas corpus, where the model, trained on the more uniform Due Parole texts, struggles even more to generalize. The confusion matrix Table 3 for the texts tested in the Anfas corpus (Table 6) further Results of automatic segmentation using decision tree and confirms the dificulty of the model in performing the Due Parole as a test set segmentation task. This matrix reveals a high number Target label Precision Recall F1-score of false positives (487), where the model incorrectly No segmentation (0) 0.90 0.90 0.90 inserts a segmentation point (label 1) when none is Segmentation (1) 0.38 0.38 0.38 required (label 0), leading to unnecessary breaks in the text. Moreover, the model fails to identify 172 actual segmentation points (false negatives), highlighting its tendency to miss where a break should occur. With Table 4 only 65 true positives out of 237 actual positive cases, Results of automatic segmentation using decision tree and the model demonstrates a limited ability to detect Anfass as a test set segmentation points. This issue is not limited to the Target label Precision Recall F1-score Anfas corpus: although results are slightly better on the No segmentation (0) 0.96 0.91 0.93 Due Parole test set (Table 5) the overall performance Segmentation (1) 0.12 0.27 0.17 remains sub-optimal. The model tends to generalize poorly when deciding where to segment, struggling both to avoid over-segmentation and to reliably identify the appropriate break points.

The decision tree model was assessed using the classification_report function from the sklearn.metrics module [18], which computes precision, recall, and F1-score. The initial evaluation was performed on a held out portion of the Due Parole corpus used as the test set. Table 3 summarizes the results obtained from this first test. Subsequently, to assess the model’s behavior on diferent types of texts, the decision tree was also tested on the Anfas corpus. As shown 6.1.1. Feature Importance Analysis

To further understand the model’s behavior, we examined

the feature importance values extracted from the trained decision trees.

As reported in Table 7, the most influential predictors in both corpora are not morphosyntactic catPredicted 0 Predicted 1 both corpora. This is striking, considering that many segmentation guidelines, including those from easy-to-read standards, emphasize splitting long sentences "where a reader would naturally pause"[1], and punctuation marks are prototypical indicators of such pauses. One plausible explanation for the low importance assigned to punctuation is related to the length of the sentences in the training data. Since many of the texts adhere to easy-toread principles, the sentences are often already short and simple, which means that internal punctuation marks (such as commas or colons) appear less frequently. As a result, punctuation rarely aligns with actual segmentation points in the dataset, reducing its statistical weight in the model’s learning process. Moreover, punctuation that does appear, such as final periods, is not annotated as a segmentation point, as it naturally marks the end of a sentence. Taken together, these factors contribute to the surprisingly low feature importance of punctuation observed in the analysis. An ablation study, which systematically removes or isolates features to assess their individual and combined efects, could improve the overall understanding of feature contributions. Additionally, the influence of punctuation could be investigated by partitioning the dataset into sentences with and without punctuation and comparing feature importance between these groups. This would clarify whether punctuation plays a diferent role depending on its presence in the sentence. These investigations are left for future work.

6.2. LLM Evaluation

Evaluating the performance of the decision tree model was straightforward thanks to the availability of standard metrics and the classification_report function from the sklearn.metrics module. However, assessing the performance of the Large Language Model egories, but rather positional features. In the Anfas (LLM) proved to be more complex. This is because, corpus, distanza_da_prima_parola, frase_len_token, and whereas the decision tree outputs a binary label (0 or frase_len_char dominate the ranking (23.2%, 20.5%, and 1) for each token, the LLM produces fully segmented 17.2% respectively), together accounting for more than sentences as output. To enable a direct comparison with 60% of the model’s decisions. These features capture the decision tree, we first converted each segmented sensentence length (in tokens and characters) as well as tence into a binary sequence. In this sequence, tokens token position within the sentence. Similarly, in Due Pa- immediately preceding a line break were assigned a label role, the top positions are held by frase_len_char (25%), of 1, except for line breaks corresponding to the final frase_len_token (17.6%), and distanza_da_prima_parola period of a sentence or cases where an entire sentence (17.1%), confirming the central role of sentence length appeared on a single line, which were labeled 0 since and token positioning. Among morphosyntactic cate- they do not represent meaningful segmentation points in gories, PRON and CCONJ are consistently relevant in our task. To ensure a fair comparison with the decision both datasets (around 9–10%), while core lexical classes tree, we aligned the length of the sequences produced such as VERB, NOUN, and ADJ play a comparatively by the LLM with those of the reference data, since the minor role (below 2% in both corpora). One unexpected evaluation metrics used, such as precision, recall, and F1 result concerns punctuation. Despite the intuitive as- score, are sensitive to sequence length and require a onesumption that punctuation strongly signals natural break to-one correspondence between tokens. For this reason, points (e.g., commas, periods, dashes), the PUNCT feature before converting the segmented sentences into binary accounts for only 0.3% of the total feature importance in sequences, we manually reviewed the LLM outputs to identify and remove noisy cases. this same subset of sentences. On the Due Parole test set, even more sentences had to be excluded from the evaluation, as shown in Table 9: 123 from the first prompt and 218 from the second. Although these exclusions occurred, we decided not to proceed with the evaluation on the Due Parole test set. Following the methodology described above, this would have left us with only 260 evaluable sentences, corresponding to just 54% of the dataset. Such a reduction could bias the evaluation, as it might disproportionately exclude not only correctly segmented instances but also those where the model fails to segment properly. Future work will investigate alternative evaluation strategies more appropriate for this setting, including metrics such as BLEU and edit distance.

6.3. Comparison between the Approaches After this filtering step, we converted the cleaned LLM

outputs into binary sequences and computed the same evaluation metrics used for the decision tree, allowing for a consistent and comparable analysis.

Table 8 shows the number of sentences per prompt that had to be removed from the Anfass test set due to changes made by the LLM in generating the output. In the case of the first prompt, the model introduced new content or altered the original sentence in 58 out of 481 cases, indicating relatively good adherence to the instructions.

In contrast, the second prompt led to 139 modified outputs. This total includes the 58 cases afected by the first prompt, most of which were also altered in the second output. The higher number of 139 modified sentences for the second prompt reflects both these overlapping cases and additional sentences uniquely altered in the second output. This increase is likely due to the vagueness of the expression "grammatical boundaries," which the model tended to interpret more strongly, often replacing simple line breaks with stronger punctuation marks, possibly due to the presence of the term "boundaries". As a result, we were able to evaluate the LLM’s performance on only 342 sentences from the original 481 in the Anfas dataset.

To ensure comparability, we applied the same filtering to the decision tree evaluation, testing it exclusively on

To provide a comprehensive evaluation, we compared the performance of the LLM-based approach, tested exclusively on the Anfass dataset, with the decision tree results, as summarized in Table 10. The LLM results reveal, once more, a marked imbalance between the two target labels (0 and 1). It is important to note that, when converting the LLM outputs into binary sequences, all sentences that appeared entirely on a single line in the corpus were automatically assigned only 0s. In cases where the corresponding gold standard sentence was also on a single line and contained no segmentation points, we modified the default behavior of the precision_recall_fscore_support function to better reflect this scenario. By default, the function may return undefined or misleading values when both y_true and y_pred contain only 0s. To avoid this, we configured the function so that it would treat such predictions as fully correct and automatically assign precision, recall, and F1-score values of 1.0. As reported in Table 10 on the Anfas reduced dataset, the LLM outperformed the decision tree overall. However, this result should be interpreted with caution, especially considering that, as shown in Section 6.2, it required excluding approximately one-quarter of the original corpus. The exclusion was necessary due to the model’s tendency to introduce extra punctuation or to generate text exceeding the original rization for public release. Once approved, they will be input. This behavior resulted in the loss of valuable data, made openly accessible to the research community, supwhich is particularly critical in contexts where data are porting future research on various aspects of Italian text already scarce, such as in easy-to-read materials. simplification.

6.4. Comments on the Results 8. Limitations and Further Work

These results should be interpreted with caution, as segmentation is a non-standard and inherently subjective task within the context of text simplification and easy-toread materials, precisely because multiple segmentations can be valid for any given sentence, each potentially facilitating comprehension in diferent ways. However, conventional evaluation metrics such as precision and recall enforce a strict binary framework, classifying predicted segmentations as either entirely correct or completely incorrect. This approach fails to consider cases where a segmentation, although diferent from the reference, is still reasonable or partially appropriate in terms of improving readability. As a result, predictions that are close to the gold standard or practically acceptable are often penalized as errors, which can underestimate the model’s true performance and limit its applicability in real-world contexts.

The Inclusion Europe guidelines provide only vague instructions on segmentation, and there are cases in which our benchmarks even contradict these guidelines. Moreover, segmentation remains a subjective task: while text layout influences decisions, multiple strategies can be equally valid for improving comprehension. Another limitation is that the psycholinguistic impact of segmentation and its role in enhancing understanding have only been explored to a limited extent. Due to time constraints, our study did not diferentiate between grammatical and ungrammatical segmentations, such as splitting an article from its noun, but this represents an interesting area for future research. For our evaluation, we used precision, recall, and F1-score, mainly to ensure comparability with the decision tree results. However, these metrics present two main limitations: first, they impose a rigid binary judgment that fails to account for the inherent subjectivity of segmentation; second, they require a strict 7. Conclusion one-to-one token correspondence, which led to the loss of valuable data whenever the model added informative The results obtained indicate that LLMs outperform a tokens to the output. As mentioned in section 6.2, future simple decision tree in the task of automatic sentence work should explore alternative evaluation strategies, segmentation. However, as previously noted, these im- such as BLEU or edit distance metrics, although the use proved results come at a cost; to properly evaluate the of edit distance would require a careful discussion to LLM, we had to substantially reduce our test set, resulting define what constitutes a meaningful edit. In addition, in the loss of valuable data in a domain where data avail- human evaluation should be considered to gain deeper ability is already limited. Additionally, LLMs demand sig- insights beyond what quantitative metrics alone can ofnificantly more computational resources and runtime, re- fer. quiring GPU acceleration to produce their outputs. Given these important considerations, it is worth discussing whether traditional machine learning approaches may Acknowledgments still be appropriate for tasks of this nature. While our results do not provide conclusive evidence in this regard, it remains possible that more sophisticated traditional models, beyond simple decision trees, could achieve competitive performance in automatic segmentation. Future research could explore alternative models better suited to handling imbalanced features and class distributions, an issue evident in our datasets. Another contribution of this work lies in the creation and compilation of the Anfas and the Due Parole datasets. Although these corpora do not include the original source texts typically present in other resources for Italian text simplification, they nonetheless represent valuable assets. Beyond their utility for segmentation research, they provide a source for broader investigations within the field of text simplification. Currently, these datasets are pending autho

This document is part of a project that has received fund

ing from the European Union’s Horizon Europe research and innovation program under Grant Agreement No. 101132431 (iDEM Project). The views and opinions expressed in this document are solely those of the author(s) and do not necessarily reflect the views of the European Union. Neither the European Union nor the granting authority can be held responsible for them. We also acknowledge support from the Spanish State Research Agency under the Maria de Maeztu Units of Excellence Program (CEX2021-001195-M). We are grateful to the reviewers for their valuable comments, which have significantly contributed to improving this work. This work was conducted during a mobility funded by the Erasmus+ Traineeship Programme of the European Union, whose support is gratefully acknowledged. cognitive efectiveness of subtitle processing, Media Psychology 13 (2010) 243–272. doi:10.1080/ [1] I. Europe, Information For All: European Standands 15213269.2010.502873.

for making information easy to read and understand [11] A. Álvarez, H. Arzelus, T. Etchegoyhen, Towards (Easy-to-read ed.), 2009. customized automatic segmentation of subtitles, [2] S. Bott, H. Saggion, Text simplification resources for in: J. L. Navarro Mesa, A. Ortega, A. Teixeira, spanish, Lang. Resour. Evaluation 48 (2014) 93–120. E. Hernández Pérez, P. Quintana Morales, A. RavURL: https://doi.org/10.1007/s10579-014-9265-4. elo García, I. Guerra Moreno, D. T. Toledano (Eds.), doi:10.1007/S10579-014-9265-4. Advances in Speech and Language Technologies [3] H. Saggion, J. O’Flaherty, T. Blanchet, S. Sharof, for Iberian Languages, Springer International PubS. Sanfilippo, L. Muñoz, M. Gollegger, A. Rascón, lishing, Cham, 2014, pp. 229–238.

J. L. Martí, S. Szasz, S. Bott, V. Sayman, [12] A. Álvarez, C.-D. Martínez-Hinarejos, H. Arzelus, Making democratic deliberation and participa- M. Balenciaga, A. del Pozo, Improving the autotion more accessible: The idem project., in: matic segmentation of subtitles through conditional A. Bonet-Jover, R. Sepúlveda-Torres, R. M. Guil- random field, Speech Communication 88 (2017) 83– lena, E. Martínez-Cámara, E. L. Pastor, Rodrigo- 95. URL: https://www.sciencedirect.com/science/ Yuste, A. Atutxa (Eds.), SEPLN (Projects and article/pii/S0167639316300127. doi:https://doi. Demonstrations), volume 3729 of CEUR Work- org/10.1016/j.specom.2017.01.010. shop Proceedings, CEUR-WS.org, 2024, pp. 71– [13] Due Parole, Due parole, s.d. URL: https://www. 76. URL: http://dblp.uni-trier.de/db/conf/sepln/ dueparole.it/.

sepln2024pd.html#SaggionOBSSMGRM24. [14] Anfas, Documenti facili da leggere, https: [4] Y. Hayashibe, K. Mitsuzawa, Sentence bound- //www.anfas.net/it/linguaggio-facile-da-leggere/ ary detection on line breaks in japanese, in: documenti-facili-da-leggere/, s.d.

WNUT, 2020. URL: https://api.semanticscholar.org/ [15] M. E. Piemontese, Scrittura e leggibilità: «due paCorpusID:226283860. role», in: M. A. Cortelazzo (Ed.), Scrivere nella [5] J. Calleja, T. Etchegoyhen, D. Ponce, Automating scuola dell’obbligo, Quaderni del Giscel, La Nuova Easy Read Text Segmentation, in: Y. Al-Onaizan, Italia, Firenze, 1991, pp. 151–167.

M. Bansal, Y.-N. Chen (Eds.), Findings of the Associ- [16] Inclusion Europe, Pathways2, s.d. URL: https:// ation for Computational Linguistics: EMNLP 2024, www.inclusion-europe.eu/pathways-2/. Association for Computational Linguistics, Miami, [17] D. Steinberg, Cart: Classification and regression Florida, USA, 2024, pp. 11876–11894. URL: https:// trees, 2009. URL: https://api.semanticscholar.org/ aclanthology.org/2024.findings-emnlp.694/. doi: 10. CorpusID:116184048, technical report.

18653/v1/2024.findings-emnlp.694. [18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, [6] T. Nomoto, Does splitting make sentence B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, easier?, Frontiers in Artificial Intelligence R. Weiss, V. Dubourg, J. VanderPlas, A. Passos, 6 (2023). URL: https://api.semanticscholar.org/ D. Cournapeau, M. Brucher, M. Perrot, É. DuchesCorpusID:262193456. nay, Scikit-learn: Machine learning in python, Jour[7] T. Nomoto, The fewer splits are better: De- nal of Machine Learning Research 12 (2011) 2825– constructing readability in sentence splitting, 2830. URL: https://scikit-learn.org/stable/modules/ ArXiv abs/2302.00937 (2023). URL: https://api. tree.html.

semanticscholar.org/CorpusID:256460905. [8] T. Passali, E. Chatzikyriakidis, S. Andreadis,

T. G. Stavropoulos, A. Matonaki, A. Fachantidis, G. Tsoumakas, From lengthy to lucid: A systematic literature review on nlp techniques for taming long sentences, ArXiv abs/2312.05172 (2023). URL: https: //api.semanticscholar.org/CorpusID:266149795. [9] I. Fajardo, V. Ávila, A. Ferrer, G. Tavares, M. Gómez,

A. M. Hernández, Easy-to-read texts for students with intellectual disability: linguistic factors afecting comprehension., Journal of applied research in intellectual disabilities : JARID 27 3 (2014) 212– 25. URL: https://api.semanticscholar.org/CorpusID: 33895340.

[10] E. Perego, F. D. Missier, M. Porta, M. M. and, The Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Paraphrase and reword. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.