<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Segmenting Italian Sentences for Easy Reading</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marta Cozzini</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Horacio Saggion</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universitat Pompeu Fabra</institution>
          ,
          <addr-line>Carrer de la Mercè 12, Ciutat Vella, 08002 Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università di Bologna</institution>
          ,
          <addr-line>Via Zamboni 33, 40126 Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Easy Read texts are essential for individuals with reading dificulties. These texts are developed according to institutional guidelines that establish clear rules for writing and structuring content in an accessible way. A key feature of Easy Read texts is the segmentation of sentences into smaller grammatical units, often presented on separate lines, to enhance readability. While several studies have addressed content simplification in easy-to-read materials, much less attention has been paid to the automatic segmentation of such texts. This project investigates whether this kind of segmentation can be automated in a reliable and eficient way, even with limited resources. The main goal is to develop and evaluate automatic methods for splitting texts into simpler, shorter units to support text simplification and improve overall readability. The methods developed and evaluated are a decision tree classifier and a prompting-based method using a large language model (LLM). The work is focused on Italian, and the application of these methodologies to this language represents a novel contribution.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Text simplification</kwd>
        <kwd>easy-to-read</kwd>
        <kwd>automatic segmentation</kwd>
        <kwd>ER resources</kwd>
        <kwd>CLiC-it</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>enhancing text comprehensibility.</p>
      <p>Easy-to-read materials are important to ensure that as The Inclusion Europe guidelines advise
many people as possible can access information, espe- against writing:
cially people with cognitive disabilities, who might find it
harder to understand complex texts or learn new things. Il modo in cui questa frase è
These specific materials follow shared guidelines de- divisa non è facile da leggere.
signed to make reading and understanding easier thanks
to clear and consistent writing. Inclusion Europe cre- Instead, they recommend:
ated easy-to-read standards for preparing this kind of
content in diferent languages [ 1]. Although these guide- Il modo in cui questa frase è divisa
lines were originally designed for people with cognitive è facile da leggere.
dificulties, they’re also helpful for others, such as
nonnative speakers or anyone who finds reading challeng- From a linguistic perspective, the first version interrupts
ing. Among the various recommendations, particular a verbal phrase composed of the auxiliary “è” and the past
attention is paid to the use of simple vocabulary, short participle “divisa.” This separation breaks the syntactic
sentences, and a clear logical structure. Some guidelines and semantic unity of the clause, making the sentence
also emphasize the importance of dividing the text into harder to process. By splitting these tightly connected
smaller grammatical units to improve readability. The In- elements across two lines, the reader’s comprehension
clusion Europe guidelines state that each sentence should efort increases. As the guidelines suggest, such breaks
ideally fit on a single line and that longer sentences should be avoided in order to maintain clarity and
facilishould be split at natural linguistic boundaries: where tate understanding.
people would pause when reading out loud. This atten- Despite the growing interest in text simplification, the
tion to segmentation is not only important for proper text task of sentence segmentation in easy-to-read
materilayout, but, as the guidelines suggest and the following als remains largely underexplored. Currently, there are
example demonstrates, it also plays a significant role in very few resources that address easy-to-read principles
in relation to automatic segmentation, and only a limited
CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- number of studies have investigated how segmentation
tics, September 24 — 26, 2025, Cagliari, Italy can be implemented computationally within this
frame* Corresponding author. work. This work aims to fill this gap by exploring whether
† These authors contributed equally. segmentation can be automated reliably and eficiently.
$ marta.cozzini@studio.unibo.it (M. Cozzini); In particular, we evaluate two approaches: a decision
horacio.saggion@upf.edu (H. Saggion) tree classifier and a prompting-based method using a
(H.0S0a0g9g-0io0n0)4-1992-9132 (M. Cozzini); 0000-0003-0016-7807 large language model (LLM). Both models are tested on
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License easy-to-read materials that we collected from sources
Attribution 4.0 International (CC BY 4.0).
that we consider particularly trustworthy in adhering 2.1. Sentence Segmentation
to oficial ER guidelines. These very materials not only
serve as the basis for our evaluation, but also represent a Sentence segmentation is particularly valuable for
cresecondary contribution of this study, as they form two ating accessible materials for individuals with reading
new corpora that can support future research not only on challenges. Line breaks strategically inserted within long
segmentation, but more broadly in the domain of Italian sentences can significantly improve readability [4]. The
text simplification. Although they do not include origi- core concept behind sentence segmentation for easy
readnal–simplified text pairs, they ofer quality examples of ing materials is the division of complex sentences into
simplified texts segmented according to established ER smaller, more digestible chunks. This segmentation must
criteria. follow "natural linguistic boundaries, ending at a position</p>
      <p>The paper is structured as follows: Section 2 reviews in the sentence where a reader would naturally pause"
related work, followed by Section 3 that introduces the [5]. While intuitive to understand, defining precise
cricorpora used in our experiments, discussing their sources, teria for these natural boundaries remains challenging.
the methodology behind their creation as easy-to-read Recent research has explored the optimal approach to
materials, and other relevant details. Section 4 provides sentence splitting for improved comprehension. Studies
a detailed description of our methodology for the seg- have found that dividing sentences does enhance
readabilmentation task, including both the decision tree and the ity, with a particular finding that bisecting the sentence
prompting approaches. Section 5 presents our experi- leads to enhanced readability to a degree greater than
mental setup, while Section 6 analyzes the results, eval- when we create simplification by trisection [ 6][7]. This
uating each method, comparing their performance, and preference for two-sentence splits over three-sentence
providing insights into the findings. Finally, Sections 7 divisions has been confirmed through Bayesian
modeland 8 conclude the paper by discussing key takeaways, ing experiments using various linguistic and cognitive
addressing limitations, and describing future research features [8]. For readers with learning dificulties, proper
directions. sentence segmentation is particularly valuable. Studies
have found that sentence density is a significant
negative predictor of inferential comprehension, meaning that
2. Related Work "the higher the sentence density, the lower the ability of
these students to find relationships between them" [ 9].</p>
      <p>Text segmentation plays an important role in promot- This finding underscores the importance of appropriate
ing textual accessibility and can be considered a relevant text segmentation for enhancing comprehension among
component of both Automatic Text Simplification (ATS) diverse reader populations.
and the development of easy-to-read materials. ATS is a
Natural Language Processing (NLP) task aimed at reduc- 2.2. Automatic Sentence Segmentation
ing linguistic complexity of texts, while preserving their
original meaning [2]. It may involve modifications at Despite increasing interest in text simplification, the
spethe lexical, syntactic, or discourse level. In recent years, cific task of automatic sentence segmentation in the
conresearch on ATS has focused on developing approaches text of easy-to-read (ER) materials remains largely
underto simplify and adapt texts for individuals with cognitive explored. To our knowledge, only one study to date has
disabilities or language impairments [3]. While ATS re- directly investigated how segmentation can be
computalies on computational strategies, easy-to-read materials tionally implemented within this framework [5], and
curare instead based on institutional guidelines that define rently, very few resources address ER principles in
relaclear rules for structuring content in an accessible way. tion to automatic segmentation. However, segmentation
These two approaches often converge on similar features plays a crucial role in related domains, most notably in
that enhance readability. These include the use of simple subtitle generation, where readability is enhanced when
vocabulary and grammar, short sentences, a clear logical subtitles are segmented at naturally occurring linguistic
structure, and the explanation of complex concepts in boundaries, in addition to meeting timing and space
consimpler terms. Within both frameworks, text segmen- straints. Research has shown that subtitle segmentation
tation is frequently emphasized: each sentence should has a significant impact on readability [ 10], leading to
ideally fit on a single line, and if this is not feasible, it the development of various computational approaches.
should be split at natural linguistic boundaries to enhance For instance, Álvarez et al. [11] trained Support Vector
clarity and facilitate comprehension. Machine and Linear Regression models on professionally
created subtitles to predict optimal subtitle breaks, later
improving this method through the use of Conditional
Random Fields [12]. These supervised approaches could,
in principle, be adapted to ER settings, provided that
sufifcient annotated training data is available. Nonetheless, segmentation: whenever the page layout allowed, each
compared to subtitling, resources for ER segmentation line was designed to contain a complete unit of
meanare extremely limited. As we mentioned before, to date, ing. If it was not possible to keep a sentence on a single
only one study has directly addressed the problem of sen- line, sentence breaks were carefully managed to avoid
tence segmentation for the generation of ER texts. This arbitrary line breaks, with each line always ending on a
work explores multiple approaches, including the use of whole word and words were never split across lines. This
generative large language models (LLMs) under difer- careful approach to segmentation shows an
understandent prompting modalities and a scoring-based method ing of its efect on readability, emphasizing that sentence
compatible with both constituency parsing and masked splitting should be deliberate and meaningful, unlike the
language modeling (MLM). In addition, it tackles the prob- more random breaks often seen in standard newspapers.
lem of data sparsity by developing new segmentation- The final corpus contains 311 articles, comprising 4855
centric datasets for Basque, English, and Spanish, thus sentences. Each article was saved as a separate plain
laying the groundwork for further research in this do- text file with a .txt extension. All files were encoded in
main [5]. UTF-8, with special characters and HTML tags removed</p>
      <p>As the first study to focus specifically on automatic during preprocessing to ensure a clean and consistent
sentence segmentation within the context of ER materi- textual format. The articles are organized in a
hierarals, it has provided a valuable foundation for our work. chical folder structure reflecting the original metadata:
Building on its insights, we aim to apply similar strate- first by publication year, then by month, and finally by
gies to address the problem of sentence segmentation magazine section (e.g., "sport", "cultura"). This structure
for Italian, a language for which text simplification re- reflects the original editorial organization and allows for
sources and research remain scarcer compared to English easy filtering by date or topic.
or Spanish.</p>
    </sec>
    <sec id="sec-2">
      <title>3. Corpora</title>
      <p>To construct our corpora, we relied on two diferent
websites: Due Parole [13] and Anfas [ 14], known for their
adherence to ER guidelines. From each source, we created
a separate corpus, which was later used in our
experiments. We describe these corpora in more detail in the
following subsections.</p>
      <sec id="sec-2-1">
        <title>3.1. Corpus from Due Parole</title>
        <p>On the Due Parole website, we accessed the online
archive of Due Parole, an Italian easy-to-read magazine
that was published, with some interruptions, between
1989 and 2006. The magazine was specifically designed
to provide accessible information to a broad audience,
with simplified texts created by a team of linguists,
journalists, and teachers from the University of Rome ’La
Sapienza’. The corpus collected from this source consists
exclusively of magazine articles, providing a consistent
and well-structured textual base for training and initial
testing of our models. From the online archive of Due
Parole, we collected only the articles available in digital
format, as web scraping was necessary to build the
corpus. During the web scraping process, we preserved all
original line breaks present in the formatted texts as
published online. To ensure that the Due Parole corpus
complied with the Inclusion Europe guidelines, we referred to
Piemontese[15], which outlines the guidelines followed
by the Due Parole team when producing easy-to-read
texts. Some of the key recommendations concerned text</p>
      </sec>
      <sec id="sec-2-2">
        <title>3.2. Corpus from Anfass</title>
        <sec id="sec-2-2-1">
          <title>Our second source of easy-to-read materials is the</title>
          <p>website of Anfas, a national association of families of
individuals with intellectual and/or relational disabilities.</p>
          <p>Anfas was one of the partners involved in the project
that led to the definition of the European easy-to-read
Guidelines [16]. Therefore, we can expect that the
texts in the section “Documenti facili da leggere”
("Easy-to-read documents") that we can find on the
website follow these oficial guidelines. From all the
easy-to-read materials published there, we selected only
the texts included in the easy-to-read magazine ’A modo
mio’. This choice was motivated by the need to align
with the other corpus, which also consisted exclusively
of magazine articles. The Anfas corpus was used
exclusively as a test set. Unlike the Due Parole corpus,
creating this corpus as plain text was more dificult
because the texts were only available in PDF format,
which ruled out the use of web scraping. We therefore
had to convert them manually. However, because there
were significantly fewer Anfas texts compared to Due
Parole, this operation did not require too much time.</p>
          <p>Similarly to Due Parole, we preserved all original line
breaks present in the formatted texts as published
online. The final corpus contains 38 articles comprising
481 sentences. The articles are organized into folders
corresponding to each magazine issue, labeled by month
and year. Within each issue folder, there is one plain text
ifle ( .txt) per magazine section (e.g., "sport", "spettacoli
e televisione").</p>
          <p>Table 1 summarizes the statistics of our corpora,
including the total number of sentences, the number of in a sentence, labeled 0), and 0 otherwise. These labels
sentences that contain at least one segmentation point, serve as the target outputs that the model is trained
and the number of sentences without any segmentation. to predict. Only after creating these target labels, the
&lt;seg&gt; markers were removed, and the cleaned sentences
Table 1 reconstructed and re-tokenized with spaCy to prepare
Corpora statistics the data for further processing (see step 4). The following
example illustrates the prepocessing steps applied to our
Sentences Due Parole Anfass corpus before training the decision tree model:
Total
With segmentation
Without segmentation</p>
          <p>From Table 1, we observe the diferences between the
two corpora in terms of segmented sentences.
Specifically, in the Due Parole corpus, the number of sentences
containing at least one segmentation point is 4,271,
corresponding to 88% of the total sentences. In contrast,
this percentage drops to 42% in the Anfass corpus. This
discrepancy is expected to afect the performance of our
decision tree model, which was trained on the Due Parole
corpus and subsequently tested on the Anfass corpus, as
we will see in Section 6.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Methodology</title>
      <sec id="sec-3-1">
        <title>To explore the viability of automatic text segmentation</title>
        <p>in low-resource settings, we adopted two diferent
approaches: a traditional machine learning method
informed by linguistic features (a decision tree) [17] and a
current prompting Large Language Model approach.</p>
        <sec id="sec-3-1-1">
          <title>4.1. Automatic Segmentation Using</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>Decision Tree</title>
          <p>We first approached the task of automatic text
segmentation as a binary classification problem. In this
framework, the model is trained to assign a binary label,
0 or 1, to each token in the input text, where 1 indicates
that a segmentation should occur immediately after that
token, while 0 means no segmentation. To build the
training data, we started from raw texts extracted from
the Due Parole dataset, that we described in the previous
section. We first segmented the texts into sentences
using spaCy’s sentence tokenizer. Before sentence
segmentation, we replaced all new line characters (\n)
occurring within the text with a special marker &lt;seg&gt;, in
order to preserve formatting information for subsequent
processing (see step 2 of the example below). We then
used the &lt;seg&gt; markers to split each sentence into
smaller chunks, corresponding to the original internal
line breaks (as shown in step 3). These splits helped
us identify potential segmentation points within the
sentence. For each token in the sentence, we assigned a
binary label: 1 if it ended a chunk (except the final chunk
1. Original input</p>
          <p>This example sentence, extracted from the
raw text, will be used to illustrate the
prepocessing steps. Note that at this stage of the pipeline,
the sentence is provided for demonstration
purposes only, as the original text has not yet
been segmented into sentences. In this example,
newline characters indicate editorial line breaks.:
La Costituzione è l’insieme
delle leggi più importanti
della Repubblica italiana.
2. Intermediate representation</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>In this intermediate form, the raw text is segmented into sentences, and newline characters are replaced with a special segmentation marker:</title>
        <p>La Costituzione è l’insieme &lt;seg&gt;
delle leggi più importanti &lt;seg&gt;
della Repubblica italiana.
3. Segmented output</p>
        <p>The text is then split into segments at the
positions marked by the &lt;seg&gt; tokens, which
serve to identify potential segmentation
boundaries:
['La Costituzione è l’insieme',
'delle leggi più importanti',
'della Repubblica italiana.']
4. Linguistic analysis</p>
      </sec>
      <sec id="sec-3-3">
        <title>Finally, the reconstructed sentence is used for token-level feature extraction in the classification model:</title>
        <p>La Costituzione è l’insieme delle
leggi più importanti della Repubblica
italiana.</p>
        <p>After reconstructing the sentences, we performed
feature extraction, including token-level features such as
part-of-speech (POS) tags, sentence length (in tokens and
characters), token length (in characters), and the token’s
position within the sentence. We converted POS tags into
binary features using one-hot encoding. Then, all the
features and target labels were organized into a tabular
structure. A decision tree classifier was then trained on
these data to predict segmentation.
4.2. Generative LLM Segmentation
confini grammaticali naturali. Ogni
segmento di testo dovrebbe contenere
tra le 5 e le 15 parole. Il contenuto
della frase originale deve essere
mantenuto rigorosamente; pertanto
non deve essere aggiunta nuova
informazione di alcun tipo. Scrivi
ogni segmento su una nuova riga, senza
numerazione o simboli all’inizio. Non
generare altro testo ad eccezione del
testo originale segmentato.</p>
        <p>Our second approach to automatic text segmentation
involved using an instruction-tuned large language model 5. Experiments
(LLM) with zero-shot prompting. The design of our
prompts was based on both the prompt strategies pro- Our first approach to automatic sentence segmentation
posed in Calleja et al.[5] and the recommendations out- was based on a traditional machine learning model. In
lined in the Inclusion Europe easy-to-read (ER) guide- particular, we employed a decision tree Classifier
implelines. Following the approach of Calleja et al. [5], we de- mented via the DecisionTreeClassifier class in the
signed two separate prompts. The first prompt (Prompt sklearn.tree Python library [18]. To ensure
replica1) aligns with the formal Inclusion Europe guidelines bility of our results, we set the random_state.
Addithat state "tagliate la frase lì dove le persone farebbero tionally, we configured the classifier with the
parameuna pausa leggendo la frase a voce alta" [1], while the ter class_weight=’balanced’, which automatically
second (Prompt 2) relies on the identification of natu- adjusts weights inversely proportional to the class
freral grammatical boundaries. Unlike Prompt 1, Prompt 2 quencies in the input data. This choice was motivated by
avoids explicit mentions of reading pauses, which could the significant imbalance in our dataset, where the target
be less accessible or meaningful to the model. To make label 1 (indicating a segmentation point) is much less
the prompts more specific, we introduced an additional frequent than label 0 (no segmentation). To reduce the
constraint on the length of the segment, specifying that negative impact of this imbalance on model performance
each segment should contain between 5 and 15 words. we adopted this built-in balancing strategy provided by
As is standard when prompting LLMs, we added also scikit-learn.
explicit instructions to ensure that the model would only For the prompting experiments, we used Gemma 2
output the requested content, without generating any 9b, part of Google’s Gemma family of lightweight,
stateadditional text. In particular, we specified that the model of-the-art decoder-only large language models. A key
should not include numbers, symbols, or bullet points at advantage of this family is the relatively small model size
the beginning of lines, as our preliminary tests revealed and the availability of open weights, which make the
a tendency to introduce such formatting elements. models suitable for deployment in resource-limited
environments such as laptops or personal cloud
infrastruc• Prompt 1: Dividi la seguente frase ture. We loaded the model and tokenizer via the Hugging
in segmenti separati, inserendo un Face Transformers library, employing automatic device
ritorno a capo dove le persone mapping and bfloat16 precision for eficient inference.
farebbero una pausa leggendo la frase Text generation was performed with controlled sampling
ad alta voce. Ogni segmento di testo parameters: a maximum of 150 new tokens, temperature
dovrebbe contenere tra le 5 e le 15 set to 0.7, and nucleus sampling top_p at 0.9.
parole. Il contenuto della frase The decision tree classifier was initially trained and
originale non deve essere alterato tested on a portion of the Due Parole corpus (see Table
in nessun modo; pertanto non deve 2), allowing an initial evaluation of its performance.
essere aggiunta nuova informazione di Subsequently, to assess the model’s behavior on diferent
alcun tipo. Scrivi ogni segmento types of texts, the decision tree was also tested on
su una nuova riga, senza numerazione the Anfas corpus. At the same time, the LLM-based
o simboli all’inizio. Non generare segmentation approach was applied exclusively to
altro testo ad eccezione del testo sentences from the Anfas corpus, in order to ensure
originale segmentato. that the results produced by the decision tree and the
LLM would be directly comparable. As will be explained
in more detail below, applying the same evaluation
• Prompt 2: Dividi la seguente frase in</p>
        <p>segmenti separati, che rispettino i
procedure to the Due Parole test set would have required in Table 4, the results difer substantially: the model
excluding a substantial portion of the data, potentially performs notably worse. This performance drop can be
biasing the results. attributed to the mismatch between the training data
and the new test data. Although both corpora adhere</p>
        <p>Table 2 shows the distribution of the Due Parole corpus to the Inclusion Europe guidelines and both consist of
across the training, validation, and test sets. magazine articles, the texts in the Due Parole corpus
exhibit a more uniform structure, largely influenced
Table 2 by the magazine’s fixed layout. In contrast, the Anfas
Data partition statistics (number of tokens) ’A modo mio’ texts, while also published in magazine
format, feature a more variable graphic layout, which</p>
        <p>Due Parole may have afected the model’s ability to generalize.</p>
        <p>Train 64252 Another contributing factor is the discrepancy in the
Validation 7140 proportion of segmented sentences between the two
Test 7933 corpora that we described in 3.2: while Due Parole
contains 88% of sentences with at least one segmentation
point, this percentage drops to only 42% in Anfass. This
results in fewer positive instances (i.e., target variable =
6. Results 1) in the Anfass corpus, which further contributes to the
already critical issue of target variable imbalance. This
To evaluate the performance of our approaches, we relied imbalance, as discussed earlier, consistently influences
on standard metrics commonly used in binary classifica- model performance both on the Due Parole test set
tion tasks, such as precision, recall, and F1-score. These and on the Anfass corpus, as reflected in the results
metrics provide a comprehensive overview of model ef- tables. It notably afects the model’s ability to correctly
fectiveness, particularly in scenarios with imbalanced identify the minority class (label 1), which corresponds
classes. to segmentation points, resulting in lower precision,
recall, and F1 scores. This trend is especially visible in
6.1. Decision Tree Evaluation the results obtained on the Anfas corpus, where the
model, trained on the more uniform Due Parole texts,
struggles even more to generalize. The confusion matrix
Table 3 for the texts tested in the Anfas corpus (Table 6) further
Results of automatic segmentation using decision tree and confirms the dificulty of the model in performing the
Due Parole as a test set segmentation task. This matrix reveals a high number
Target label Precision Recall F1-score of false positives (487), where the model incorrectly
No segmentation (0) 0.90 0.90 0.90 inserts a segmentation point (label 1) when none is
Segmentation (1) 0.38 0.38 0.38 required (label 0), leading to unnecessary breaks in the
text. Moreover, the model fails to identify 172 actual
segmentation points (false negatives), highlighting its
tendency to miss where a break should occur. With
Table 4 only 65 true positives out of 237 actual positive cases,
Results of automatic segmentation using decision tree and the model demonstrates a limited ability to detect
Anfass as a test set segmentation points. This issue is not limited to the
Target label Precision Recall F1-score Anfas corpus: although results are slightly better on the
No segmentation (0) 0.96 0.91 0.93 Due Parole test set (Table 5) the overall performance
Segmentation (1) 0.12 0.27 0.17 remains sub-optimal. The model tends to generalize
poorly when deciding where to segment, struggling both
to avoid over-segmentation and to reliably identify the
appropriate break points.</p>
        <p>The decision tree model was assessed using
the classification_report function from the
sklearn.metrics module [18], which computes
precision, recall, and F1-score. The initial evaluation was
performed on a held out portion of the Due Parole corpus
used as the test set. Table 3 summarizes the results
obtained from this first test. Subsequently, to assess the
model’s behavior on diferent types of texts, the decision
tree was also tested on the Anfas corpus. As shown
6.1.1. Feature Importance Analysis</p>
      </sec>
      <sec id="sec-3-4">
        <title>To further understand the model’s behavior, we examined</title>
        <p>the feature importance values extracted from the trained
decision trees.</p>
        <p>As reported in Table 7, the most influential
predictors in both corpora are not morphosyntactic
catPredicted 0
Predicted 1
both corpora. This is striking, considering that many
segmentation guidelines, including those from easy-to-read
standards, emphasize splitting long sentences "where a
reader would naturally pause"[1], and punctuation marks
are prototypical indicators of such pauses. One plausible
explanation for the low importance assigned to
punctuation is related to the length of the sentences in the
training data. Since many of the texts adhere to
easy-toread principles, the sentences are often already short and
simple, which means that internal punctuation marks
(such as commas or colons) appear less frequently. As a
result, punctuation rarely aligns with actual
segmentation points in the dataset, reducing its statistical weight
in the model’s learning process. Moreover, punctuation
that does appear, such as final periods, is not annotated
as a segmentation point, as it naturally marks the end of
a sentence. Taken together, these factors contribute to
the surprisingly low feature importance of punctuation
observed in the analysis. An ablation study, which
systematically removes or isolates features to assess their
individual and combined efects, could improve the
overall understanding of feature contributions. Additionally,
the influence of punctuation could be investigated by
partitioning the dataset into sentences with and without
punctuation and comparing feature importance between
these groups. This would clarify whether punctuation
plays a diferent role depending on its presence in the
sentence. These investigations are left for future work.</p>
        <sec id="sec-3-4-1">
          <title>6.2. LLM Evaluation</title>
          <p>Evaluating the performance of the decision tree model
was straightforward thanks to the availability of
standard metrics and the classification_report
function from the sklearn.metrics module. However,
assessing the performance of the Large Language Model
egories, but rather positional features. In the Anfas (LLM) proved to be more complex. This is because,
corpus, distanza_da_prima_parola, frase_len_token, and whereas the decision tree outputs a binary label (0 or
frase_len_char dominate the ranking (23.2%, 20.5%, and 1) for each token, the LLM produces fully segmented
17.2% respectively), together accounting for more than sentences as output. To enable a direct comparison with
60% of the model’s decisions. These features capture the decision tree, we first converted each segmented
sensentence length (in tokens and characters) as well as tence into a binary sequence. In this sequence, tokens
token position within the sentence. Similarly, in Due Pa- immediately preceding a line break were assigned a label
role, the top positions are held by frase_len_char (25%), of 1, except for line breaks corresponding to the final
frase_len_token (17.6%), and distanza_da_prima_parola period of a sentence or cases where an entire sentence
(17.1%), confirming the central role of sentence length appeared on a single line, which were labeled 0 since
and token positioning. Among morphosyntactic cate- they do not represent meaningful segmentation points in
gories, PRON and CCONJ are consistently relevant in our task. To ensure a fair comparison with the decision
both datasets (around 9–10%), while core lexical classes tree, we aligned the length of the sequences produced
such as VERB, NOUN, and ADJ play a comparatively by the LLM with those of the reference data, since the
minor role (below 2% in both corpora). One unexpected evaluation metrics used, such as precision, recall, and F1
result concerns punctuation. Despite the intuitive as- score, are sensitive to sequence length and require a
onesumption that punctuation strongly signals natural break to-one correspondence between tokens. For this reason,
points (e.g., commas, periods, dashes), the PUNCT feature before converting the segmented sentences into binary
accounts for only 0.3% of the total feature importance in sequences, we manually reviewed the LLM outputs to
identify and remove noisy cases.
this same subset of sentences. On the Due Parole test
set, even more sentences had to be excluded from the
evaluation, as shown in Table 9: 123 from the first prompt
and 218 from the second. Although these exclusions
occurred, we decided not to proceed with the evaluation
on the Due Parole test set. Following the methodology
described above, this would have left us with only 260
evaluable sentences, corresponding to just 54% of the
dataset. Such a reduction could bias the evaluation, as
it might disproportionately exclude not only correctly
segmented instances but also those where the model fails
to segment properly. Future work will investigate
alternative evaluation strategies more appropriate for this
setting, including metrics such as BLEU and edit distance.</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>6.3. Comparison between the Approaches</title>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>After this filtering step, we converted the cleaned LLM</title>
        <p>outputs into binary sequences and computed the same
evaluation metrics used for the decision tree, allowing
for a consistent and comparable analysis.</p>
        <p>Table 8 shows the number of sentences per prompt that
had to be removed from the Anfass test set due to changes
made by the LLM in generating the output. In the case
of the first prompt, the model introduced new content
or altered the original sentence in 58 out of 481 cases,
indicating relatively good adherence to the instructions.</p>
        <p>In contrast, the second prompt led to 139 modified
outputs. This total includes the 58 cases afected by the first
prompt, most of which were also altered in the second
output. The higher number of 139 modified sentences for
the second prompt reflects both these overlapping cases
and additional sentences uniquely altered in the second
output. This increase is likely due to the vagueness of the
expression "grammatical boundaries," which the model
tended to interpret more strongly, often replacing simple
line breaks with stronger punctuation marks, possibly
due to the presence of the term "boundaries". As a result,
we were able to evaluate the LLM’s performance on only
342 sentences from the original 481 in the Anfas dataset.</p>
        <p>To ensure comparability, we applied the same filtering
to the decision tree evaluation, testing it exclusively on</p>
        <p>To provide a comprehensive evaluation, we compared
the performance of the LLM-based approach, tested
exclusively on the Anfass dataset, with the decision tree
results, as summarized in Table 10. The LLM results
reveal, once more, a marked imbalance between the
two target labels (0 and 1). It is important to note
that, when converting the LLM outputs into binary
sequences, all sentences that appeared entirely on a
single line in the corpus were automatically assigned only
0s. In cases where the corresponding gold standard
sentence was also on a single line and contained no
segmentation points, we modified the default
behavior of the precision_recall_fscore_support
function to better reflect this scenario. By default, the function
may return undefined or misleading values when both
y_true and y_pred contain only 0s. To avoid this, we
configured the function so that it would treat such
predictions as fully correct and automatically assign precision,
recall, and F1-score values of 1.0. As reported in Table
10 on the Anfas reduced dataset, the LLM outperformed
the decision tree overall. However, this result should be
interpreted with caution, especially considering that, as
shown in Section 6.2, it required excluding approximately
one-quarter of the original corpus. The exclusion was
necessary due to the model’s tendency to introduce extra
punctuation or to generate text exceeding the original rization for public release. Once approved, they will be
input. This behavior resulted in the loss of valuable data, made openly accessible to the research community,
supwhich is particularly critical in contexts where data are porting future research on various aspects of Italian text
already scarce, such as in easy-to-read materials. simplification.</p>
        <sec id="sec-3-5-1">
          <title>6.4. Comments on the Results</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>8. Limitations and Further Work</title>
      <p>These results should be interpreted with caution, as
segmentation is a non-standard and inherently subjective
task within the context of text simplification and
easy-toread materials, precisely because multiple segmentations
can be valid for any given sentence, each potentially
facilitating comprehension in diferent ways. However,
conventional evaluation metrics such as precision and recall
enforce a strict binary framework, classifying predicted
segmentations as either entirely correct or completely
incorrect. This approach fails to consider cases where
a segmentation, although diferent from the reference,
is still reasonable or partially appropriate in terms of
improving readability. As a result, predictions that are
close to the gold standard or practically acceptable are
often penalized as errors, which can underestimate the
model’s true performance and limit its applicability in
real-world contexts.</p>
      <p>The Inclusion Europe guidelines provide only vague
instructions on segmentation, and there are cases in which
our benchmarks even contradict these guidelines.
Moreover, segmentation remains a subjective task: while text
layout influences decisions, multiple strategies can be
equally valid for improving comprehension. Another
limitation is that the psycholinguistic impact of
segmentation and its role in enhancing understanding have only
been explored to a limited extent. Due to time constraints,
our study did not diferentiate between grammatical and
ungrammatical segmentations, such as splitting an
article from its noun, but this represents an interesting area
for future research. For our evaluation, we used
precision, recall, and F1-score, mainly to ensure comparability
with the decision tree results. However, these metrics
present two main limitations: first, they impose a rigid
binary judgment that fails to account for the inherent
subjectivity of segmentation; second, they require a strict
7. Conclusion one-to-one token correspondence, which led to the loss
of valuable data whenever the model added informative
The results obtained indicate that LLMs outperform a tokens to the output. As mentioned in section 6.2, future
simple decision tree in the task of automatic sentence work should explore alternative evaluation strategies,
segmentation. However, as previously noted, these im- such as BLEU or edit distance metrics, although the use
proved results come at a cost; to properly evaluate the of edit distance would require a careful discussion to
LLM, we had to substantially reduce our test set, resulting define what constitutes a meaningful edit. In addition,
in the loss of valuable data in a domain where data avail- human evaluation should be considered to gain deeper
ability is already limited. Additionally, LLMs demand sig- insights beyond what quantitative metrics alone can
ofnificantly more computational resources and runtime, re- fer.
quiring GPU acceleration to produce their outputs. Given
these important considerations, it is worth discussing
whether traditional machine learning approaches may Acknowledgments
still be appropriate for tasks of this nature. While our
results do not provide conclusive evidence in this regard,
it remains possible that more sophisticated traditional
models, beyond simple decision trees, could achieve
competitive performance in automatic segmentation. Future
research could explore alternative models better suited
to handling imbalanced features and class distributions,
an issue evident in our datasets. Another contribution
of this work lies in the creation and compilation of the
Anfas and the Due Parole datasets. Although these
corpora do not include the original source texts typically
present in other resources for Italian text simplification,
they nonetheless represent valuable assets. Beyond their
utility for segmentation research, they provide a source
for broader investigations within the field of text
simplification. Currently, these datasets are pending
autho</p>
      <sec id="sec-4-1">
        <title>This document is part of a project that has received fund</title>
        <p>ing from the European Union’s Horizon Europe research
and innovation program under Grant Agreement No.
101132431 (iDEM Project). The views and opinions
expressed in this document are solely those of the author(s)
and do not necessarily reflect the views of the European
Union. Neither the European Union nor the granting
authority can be held responsible for them. We also
acknowledge support from the Spanish State Research
Agency under the Maria de Maeztu Units of Excellence
Program (CEX2021-001195-M). We are grateful to the
reviewers for their valuable comments, which have
significantly contributed to improving this work. This work
was conducted during a mobility funded by the Erasmus+
Traineeship Programme of the European Union, whose
support is gratefully acknowledged.
cognitive efectiveness of subtitle processing,
Media Psychology 13 (2010) 243–272. doi:10.1080/
[1] I. Europe, Information For All: European Standands 15213269.2010.502873.</p>
        <p>for making information easy to read and understand [11] A. Álvarez, H. Arzelus, T. Etchegoyhen, Towards
(Easy-to-read ed.), 2009. customized automatic segmentation of subtitles,
[2] S. Bott, H. Saggion, Text simplification resources for in: J. L. Navarro Mesa, A. Ortega, A. Teixeira,
spanish, Lang. Resour. Evaluation 48 (2014) 93–120. E. Hernández Pérez, P. Quintana Morales, A.
RavURL: https://doi.org/10.1007/s10579-014-9265-4. elo García, I. Guerra Moreno, D. T. Toledano (Eds.),
doi:10.1007/S10579-014-9265-4. Advances in Speech and Language Technologies
[3] H. Saggion, J. O’Flaherty, T. Blanchet, S. Sharof, for Iberian Languages, Springer International
PubS. Sanfilippo, L. Muñoz, M. Gollegger, A. Rascón, lishing, Cham, 2014, pp. 229–238.</p>
        <p>J. L. Martí, S. Szasz, S. Bott, V. Sayman, [12] A. Álvarez, C.-D. Martínez-Hinarejos, H. Arzelus,
Making democratic deliberation and participa- M. Balenciaga, A. del Pozo, Improving the
autotion more accessible: The idem project., in: matic segmentation of subtitles through conditional
A. Bonet-Jover, R. Sepúlveda-Torres, R. M. Guil- random field, Speech Communication 88 (2017) 83–
lena, E. Martínez-Cámara, E. L. Pastor, Rodrigo- 95. URL: https://www.sciencedirect.com/science/
Yuste, A. Atutxa (Eds.), SEPLN (Projects and article/pii/S0167639316300127. doi:https://doi.
Demonstrations), volume 3729 of CEUR Work- org/10.1016/j.specom.2017.01.010.
shop Proceedings, CEUR-WS.org, 2024, pp. 71– [13] Due Parole, Due parole, s.d. URL: https://www.
76. URL: http://dblp.uni-trier.de/db/conf/sepln/ dueparole.it/.</p>
        <p>sepln2024pd.html#SaggionOBSSMGRM24. [14] Anfas, Documenti facili da leggere, https:
[4] Y. Hayashibe, K. Mitsuzawa, Sentence bound- //www.anfas.net/it/linguaggio-facile-da-leggere/
ary detection on line breaks in japanese, in: documenti-facili-da-leggere/, s.d.</p>
        <p>WNUT, 2020. URL: https://api.semanticscholar.org/ [15] M. E. Piemontese, Scrittura e leggibilità: «due
paCorpusID:226283860. role», in: M. A. Cortelazzo (Ed.), Scrivere nella
[5] J. Calleja, T. Etchegoyhen, D. Ponce, Automating scuola dell’obbligo, Quaderni del Giscel, La Nuova
Easy Read Text Segmentation, in: Y. Al-Onaizan, Italia, Firenze, 1991, pp. 151–167.</p>
        <p>M. Bansal, Y.-N. Chen (Eds.), Findings of the Associ- [16] Inclusion Europe, Pathways2, s.d. URL: https://
ation for Computational Linguistics: EMNLP 2024, www.inclusion-europe.eu/pathways-2/.
Association for Computational Linguistics, Miami, [17] D. Steinberg, Cart: Classification and regression
Florida, USA, 2024, pp. 11876–11894. URL: https:// trees, 2009. URL: https://api.semanticscholar.org/
aclanthology.org/2024.findings-emnlp.694/. doi: 10. CorpusID:116184048, technical report.</p>
        <p>18653/v1/2024.findings-emnlp.694. [18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
[6] T. Nomoto, Does splitting make sentence B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
easier?, Frontiers in Artificial Intelligence R. Weiss, V. Dubourg, J. VanderPlas, A. Passos,
6 (2023). URL: https://api.semanticscholar.org/ D. Cournapeau, M. Brucher, M. Perrot, É.
DuchesCorpusID:262193456. nay, Scikit-learn: Machine learning in python,
Jour[7] T. Nomoto, The fewer splits are better: De- nal of Machine Learning Research 12 (2011) 2825–
constructing readability in sentence splitting, 2830. URL: https://scikit-learn.org/stable/modules/
ArXiv abs/2302.00937 (2023). URL: https://api. tree.html.</p>
        <p>semanticscholar.org/CorpusID:256460905.
[8] T. Passali, E. Chatzikyriakidis, S. Andreadis,</p>
        <p>T. G. Stavropoulos, A. Matonaki, A. Fachantidis,
G. Tsoumakas, From lengthy to lucid: A systematic
literature review on nlp techniques for taming long
sentences, ArXiv abs/2312.05172 (2023). URL: https:
//api.semanticscholar.org/CorpusID:266149795.
[9] I. Fajardo, V. Ávila, A. Ferrer, G. Tavares, M. Gómez,</p>
        <p>A. M. Hernández, Easy-to-read texts for students
with intellectual disability: linguistic factors
afecting comprehension., Journal of applied research
in intellectual disabilities : JARID 27 3 (2014) 212–
25. URL: https://api.semanticscholar.org/CorpusID:
33895340.</p>
        <p>[10] E. Perego, F. D. Missier, M. Porta, M. M. and, The
Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Paraphrase
and reword. After using these tool(s)/service(s), the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>