=Paper=
{{Paper
|id=Vol-3878/88_main_long
|storemode=property
|title=Is Sentence Splitting a Solved Task? Experiments to the Intersection between NLP and Italian Linguistics
|pdfUrl=https://ceur-ws.org/Vol-3878/88_main_long.pdf
|volume=Vol-3878
|authors=Arianna Redaelli,Rachele Sprugnoli
|dblpUrl=https://dblp.org/rec/conf/clic-it/RedaelliS24
}}
==Is Sentence Splitting a Solved Task? Experiments to the Intersection between NLP and Italian Linguistics==
Is Sentence Splitting a Solved Task? Experiments to the
Intersection Between NLP and Italian Linguistics
Arianna Redaelli1 , Rachele Sprugnoli1,*
1
Università di Parma, Via D’Azeglio, 85, 43125 Parma, Italy
Abstract
Sentence splitting, that is the segmentation of the raw input text into sentences, is a fundamental step in text processing.
Although it is considered a solved task for texts such as news articles and Wikipedia pages, the performance of systems
can vary greatly depending on the text genre. This paper presents the evaluation of the performance of eight sentence
splitting tools adopting different approaches (rule-based, supervised, semi-supervised, and unsupervised learning) on Italian
19th-century novels, a genre that has not received sufficient attention so far but which can be an interesting common ground
between Natural Language Processing and Digital Humanities.
Keywords
sentence splitting, text segmentation, literary texts, Italian
1. Introduction Stanza [6] and spaCy2 , have mostly been trained and
evaluated on fairly formal texts, such as news articles and
Sentence splitting is the process of segmenting a text Wikipedia pages, so the publicly reported performances
into sentences1 by detecting their boundaries, which, at tend to be high, i.e. above 0.90 in terms of F1. However,
least for Western languages, including Italian, usually the text genre has a significant impact on the results. For
correspond to certain punctuation marks [2]. This means example, in the CoNLL 2018 shared task “Multilingual
that sentence splitting, for many languages, is a mat- Parsing from Raw Text to Universal Dependencies”, the
ter of punctuation disambiguation, that is, recognizing best system on the Italian ISDT treebank [7] achieved a
when a punctuation mark signals a sentence boundary F1 of 0.99, while on the PoSTWITA treebank, made of
or not. The importance of sentence splitting is often un- tweets [8], the highest result was 0.66.
derestimated because it is considered an easy task, but its Given these variations, considering less formal text
quality has a strong impact on the quality of subsequent genres could provide valuable insights into the challenges
text processing because errors can propagate reducing of sentence splitting. Among these genres are literary
the performance of downstream tasks such as Syntac- texts, which present unique and peculiar stylistic and
tic Analysis [3], Machine Translation [4] and Automatic creative features that can break traditional grammatical
Summarization [5]. norms, including punctuation ones [9]. These features de-
The most popular pipeline models, such as those of pend on both authorial choices and the cultural context of
the time. As a matter of facts, punctuation can vary signif-
icantly depending on the historical period; literary texts
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, may follow prevailing trends or oppose them, giving rise
Dec 04 — 06, 2024, Pisa, Italy to new trends. This phenomenon is particularly evident
*
Corresponding author.
† in 19th century, when the Italian usus punctandi began
This paper is the result of the collaboration between the two au-
shifting from a primarily syntactic usage, prescribed by
thors. For the specific concerns of the Italian academic attribution
system: Rachele Sprugnoli is responsible for Sections 2, 3, 6; Ar- grammar books, to a communicative-textual usage of
ianna Redaelli is responsible for Sections 1, 4, 8. Section 7 were punctuation marks [10]. Since this shift was probably
collaboratively written by the two authors. influenced by the reflections and the practical uses of
$ arianna.redaelli@unipr.it (A. Redaelli); prominent authors such as Alessandro Manzoni [11], our
rachele.sprugnoli@unipr.it (R. Sprugnoli)
study focuses on his historical novel, “I Promessi Sposi”.
0000-0001-6374-9033 (A. Redaelli); 0000-0001-6861-5595
(R. Sprugnoli) The author paid meticulous attention to the punctuation
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License of the text, revising it up to the final print proofs, and
Attribution 4.0 International (CC BY 4.0).
1
By "sentence" we mean a coherent set of words constructed ac- made specific and personal choices in collaboration with
cording to the general rules of the language, conveying a complete the publisher, alongside more classical ones [12]. Al-
thought that makes sense on its own [1]. A sentence ends with
though not always consistent, Manzoni’s decisions make
a strong punctuation mark (e.g., full stop, question mark, or ex-
clamation point) and is typically followed by a capital letter. The the novel particularly complex and interesting from a
definition of sentence adopted here, which like any definition is punctuation perspective. Furthermore, “I Promessi Sposi”
inherently problematic, is motivated by the specific requirements
2
of the present work, as will be seen below. https://spacy.io
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
has been a fundamental reference for the development of text genre on sentence splitting, but literary texts are
a common written Italian language: starting from this as- rarely considered. For example, Liu et al. [16] work on
sumption, many of the author’s punctuation choices have speech transcriptions, Sheik et al. [17] on legal texts, and
been adopted by later grammars for rule-making, though Rudrapal et al. [18] on social media posts. Moreover, a
only some of them have become part of the standard. shared task on sentence boundary detection in the finan-
Given that punctuation was still undergoing standard- cial domain (FinSBD) was organized in 2019, 2020 and
ization at the time, and that its use can depend not only 2021 [19].
on the conventions of the period but also on the writer’s Most of the available studies concern the processing
personal style, the type of content being addressed (and of English texts while Italian is usually not included in
how it is presented), and even the influence of typog- the evaluation. An interesting exception is given by a
raphy during the printing process, we also decided to work on multilingual legal texts that contains a detailed
broaden our study to include sections from other novels evaluation of the results on Italian documents [20].
contemporary to Manzoni’s (1840-42). Specifically, we Our work draws inspiration from the assessment on
analyzed "I Malavoglia" (1881) by Giovanni Verga, "Le English texts provided by Read et al. [21] which includes,
avventure di Pinocchio. Storia di un burattino" (1883) by among others, the Sherlock Holmes stories, but moving
Carlo Collodi, and "Cuore" (1886) by Edmondo de Amicis. to the Italian context. Furthermore, we focus on the
In this paper, our main contributions are as follows: literary context showing how 19th-century novels are a
(i) we provide an estimate of the performance of eight challenge for current sentence splitting systems.
sentence splitting tools adopting different approaches on
a specific and challenging text genre, namely historical
literary fiction texts, which has not received enough at- 3. Tools
tention so far; (ii) we compare the results considering the
Sentence splitting is a fundamental analysis in text pro-
point of view of humanities scholars (in particular Italian
cessing, for which there are many tools available, also
linguistics) as the main stakeholders in the considered do-
for Italian. For our evaluation we have selected eight
main, in order to establish a flourishing cross-fertilization
tools developed with different approaches. Some tools
between NLP and Digital Humanities; (iii) we release
are modules integrated in larger pipelines, others are
manually split data for four 19th-century Italian novels
systems specifically created to perform only sentence
and a shared notebook where to run many of the tested
splitting. It is important to note that selected tools do
systems.3
not split in the presence of a colon or semicolon. Indeed,
although recent studies in the punctuation field identify
2. Related Work the colons and semicolons as punctuation marks capable
of indicating the boundary of a sentence [22], as antic-
Sentence splitting systems can be categorized into three ipated in footnote 1, in this work we have decided to
macro-classes based on the approach used to develop not consider them as separating marks because of the
them. There are rule-based systems, such as Sentence various forms literary texts can take. To clarify the is-
Splitter4 and the Sentencizer module of spaCy, that sue, we can consider the example of direct speech. In “I
use heuristics specific to the various languages and lists Promessi Sposi”, direct speech can be introduced by a
of exceptions and abbreviations. Then, there are super- verbum dicendi and the colons, continuing without any
vised systems that need datasets in which sentences are interruption. In such cases, splitting at the colons would
already correctly segmented to be trained. For example, be relatively easy. However, direct speech can also be
UDPipe [13] and Stanza are trained on Universal De- embedded within a sentence that continues after the quo-
pendencies (UD) treebanks [14]. Finally, unsupervised tation closes, creating a non-autonomous text portion
systems are trained on datasets of non-segmented texts that, during sentence splitting, should be manually re-
taking advantage of features such as the length of words connected to the one preceding the quotation itself (e.g.,
and collocational information. An example is given by Lucia sospirò, e ripeté: «coraggio,» con una voce che smen-
Punkt, available as a module within the NLTK (Natural tiva la parola. EN: Lucia sighed, and repeated, «courage,»
Language Toolkit) library [15]. In our work, we test these in a voice that belied the word.). An equally troublesome
various approaches on a benchmark dataset of historical problem arises when the diegetic frame follows the quo-
literary fiction texts by evaluating the performance of tation instead of preceding it. When this happens, the
eight different systems. colons are absent, and other punctuation marks like com-
There are several studies that analyze the impact of mas are found before the closing quotation marks or dash
(e.g., «È il mio caso,» disse Renzo. EN: «That’s my case,»
3
https://github.com/RacheleSprugnoli/Sentence_Splitting_ said Renzo.). The system would not split the sentences at
Manzoni these punctuation marks, yet the diegetic frame follow-
4
https://github.com/mediacloud/sentence-splitter
ing the direct speech has the same value and autonomy as • WtP10 : an unsupervised multilingual sentence
the one preceding it. Consequently, considering colons segmentation system based on a self-supervised
and semicolons as sentence boundaries would make the learning approach tested on 85 languages, in-
segmentation much more complex and often inaccurate. cluding Italian. It does not rely on punctua-
Selected tools are the following: tion or sentence-segmented training data thus it
is a punctuation-agnostic system [27]. Among
• CoreNLP5 : an NLP pipeline written in Java and the various available models, we adopted the
developed by Stanford University [23]. It contains wtp-canine-s-12l which, according to the of-
various modules including ssplit that divides ficial documentation of the tool, have the best
a text into sentences via a set of rules. The lat- results on languages other than English.
est version of the pipeline (4.5.7) supports eight
languages including Italian. For the evaluation, the tools were used as they are,
• spaCy: an open-source NLP library which sup- using their default configurations, without making any
ports dozens of languages, including Italian, and customization. For this reason, given the choices moti-
provides four alternatives for sentence splitting. vated above, we did not consider other systems, such as
Among these, statistical models for Italian have Tint [28], which by default split at colons and semicolons.
been trained to split on colons and semicolons.
For this reason, we tested the performance only
of Sentencizer, the rule-based pipeline com- 4. Dataset
ponent. The data used to evaluate the aforementioned tools are
• Sentence Splitter6 : a Python module based taken from “I Promessi Sposi” in its final version pub-
on scripts developed for processing the Europarl lished in 1840-184211 . 3,095 sentences, corresponding
corpus [24]. It supports several languages with to 12 chapters of the novel, were manually split. This
ad-hoc rules. dataset was divided into training, development and test
• UDPipe7 : an NLP pipeline based on the UD frame- sets according to the proportions 80/10/10 and using the
work performing tokenization, sentence splitting, UD rules for which this proportion was calculated using
PoS tagging, lemmatization and syntactic analy- syntactic words as units.12 To obtain syntactic words
sis. UDPipe 2 is written in Python and uses the and calculate this splitting, sentences were segmented
tokenizer of UDPipe 1; among the 131 most re- and tokenized by hand; this gold standard was then pro-
cent models (version 2.12), seven are for Italian. cessed with the combined Stanza model.13 Following this
We evaluated the model trained on the VIT tree- division, the test set is made of 324 sentences.
bank [25] that does not (always) split at colons Table 1 shows the sentence-ending punctuation marks
and semicolons. in the test set. Both the total number of occurrences
• Stanza8 : an NLP package written in Python and (TOTAL) and the number of times a sign is an end-of-
based on neural network components. Sentence sentence marker (EOS) are reported. In addition to the
splitting is jointly performed with tokenization by full stop, sentence boundaries can be indicated by ex-
the TokenizeProcessor module. The default pressive punctuation marks (!, ?) when followed by a
Italian model is a combination of multiple UD capital letter. If followed by a lowercase letter, instead,
treebanks. these marks only have an expressive role, modifying
• Ersatz9 : a language-agnostic neural model the sentence’s internal intonation without determining
based on a semi-supervised training paradigm. its end. Low quotation marks («») and long dashes (–),
It combines the use of regular-expressions to used for direct speech and thoughts respectively, typi-
detect candidate sentence boundaries with a cally determine a sentence boundary when they appear
Transformer-based binary classifier [26]. with another demarcative punctuation mark (e.g., a full
• Punkt: an unsupervised system which uses col- stop). In Manzoni’s novel, if a closing quotation mark
locational information to identify abbreviations, (guillemets or long dashes) appears with another punctu-
initials, and ordinal numbers. All punctuation ation mark, the latter is usually placed before the former,
not included in these elements is considered an
end-of-sentence marker. 10
https://github.com/segment-any-text/wtpsplit
11
The text, fully digitized and available online, was collated with
the reference edition [29] prior to analysis, to ensure maximum
5 fidelity to the author’s punctuation choices.
https://stanfordnlp.github.io/CoreNLP/
6 12
https://github.com/mediacloud/sentence-splitter https://universaldependencies.org/release_checklist.html#
7 data-split
https://ufal.mff.cuni.cz/udpipe
8 13
https://stanfordnlp.github.io/stanza/ The output of this process was used to train a new Stanza model
9
https://github.com/rewicks/ersatz as reported in Section 6.
Table 1 sign of the low quotation marks is not recognized
End-of-sentence markers in the test set. as a sentence boundary, so in the automatic seg-
MARK # TOTAL # EOS mentation it can appear at the beginning or in
. 277 237 the middle of a sentence.
» 90 53 2. In supervised systems semicolons and colons are
? 47 22 sometimes considered as sentence boundary sig-
! 31 6 nals. Indeed, in the VIT treebank and in those
... 23 3 used to train the combined Stanza model, sen-
– 10 3 tences are segmented inconsistently: sometimes
semicolons and colons are strong punctuation,
and sometimes not.
which formally closes the sentence. Lastly, in the novel, 3. Suspension points are always considered strong
suspension points (...) can indicate a sentence bound- punctuation marks and the sentence is splitted
ary when they suggest a suspensive allusion or when after them.
they mark the interruption of a character’s line due to 4. A sentence is often split after an expressive punc-
linguistic or extra-linguistic contingencies. In such cases, tuation mark (?, !) even if it is followed by a
suspension points’ demarcative function is shown either lowercase letter.
by the following capital letter or by an opening quota- 5. The long dash is not recognized as a sentence-
tion mark which indicates the beginning of a different ending marker; consequently, either the sentence
character’s line. continues after the dash or the dash appears at
the beginning of the following sentence.
5. Results of the Evaluation
Table 2 reports the results of our evaluation in terms
6. Training a New Stanza Model
of F1. The best performance (0.94) is registered with With the rest of the manually split data, namely 2,447
Sentence Splitter, a rule-based system. All other
sentences for the training set and 324 for the development
tools do not exceed 0.70, thus having significantly lower set, a new Stanza model specific for Manzoni’s text was
performances than those reported on contemporary Ital- trained. Different amounts of sentences were used as
ian texts. For example, the official result of UDPipe 2 training in order to control the effect of the dataset size
on the VIT treebank with the 2.12 model starting from on the performance. The results obtained with 1500 steps
a raw text is 0.95, that is almost 30 points more than are the following:
what is obtained on our test set. The lowest result (0.51)
is obtained by the unsupervised WtP system. Although • 300 sentences: 0.97 F1
the rule-based approach seems to be the most promising, • 1000 sentences: 0.98 F1
only Sentence Splitter has an excellent result even • 2,447 sentences: 0.99 F1
without any adaptation of the existing rules. With just 300 sentences there is already a clear improve-
ment over the default model, obtaining an even higher
Table 2 result than the one obtained with Sentence Splitter,
Results (in terms of F1) of eight systems developed with the system that had proven to be the best on our test set.
different approaches: rule-based (RB), supervised (S), semi-
supervised (SS) and unsupervised learning (U).
7. What About Other Novels?
TYPE SYSTEM F1
RB spaCy sentencizer 0.61 Table 4 displays the performance of the same systems
CoreNLP 4.5.7 ssplit 0.66
tested on “I Promessi Sposi” on the first approximately
SentenceSplitter 0.94
S UDPipe 2 VIT model 0.66 90 sentences of three other important 19th-century nov-
Stanza combined 0.69 els:14 “I Malavoglia” (1881) by Giovanni Verga [30], “Le
SS Ersatz 0.60 avventure di Pinocchio. Storia di un burattino” (1883) by
U
Punkt 0.68 Carlo Collodi [31], “Cuore” (1886) by Edmondo de Amicis
WtP wtp-canine-s-12l 0.51 [32].15
14
The reference edition text was used for the analysis of these novels
Analyzing the outputs of the various systems, it is too.
possible to notice some recurring errors (few examples 15
86 sentences are taken from “I Malavoglia”, corresponding to the
are reported in Table 3): first chapter of the novel; 93 sentences, that is the first two chapters,
come from “Le avventure di Pinocchio”; 87 sentences are taken
1. Misinterpretation of guillemets («,»). The closing “Cuore”, corresponding to the first three chapters of the novel.
Table 3
Examples of errors in two of the tested systems compared with the manually splitted sentences.
TEST GOLD UDPipe 2 -VIT model Ersatz
1) «Al sagrestano gli crede?» 1) » «Al sagrestano gli crede?
1) » «Al sagrestano gli crede?» «Perché?»
2) «Perché?» 2) » «Perché?
1) – È lei, di certo!– 1) – È lei, di certo!– Era proprio lei, 1) – È lei, di certo!
2) Era proprio lei, con la buona vedova. con la buona vedova. 2) – Era proprio lei, con la buona vedova.
1) Anche Agnese, veda; anche Agnese. . . » 1) Anche Agnese, veda; anche Agnese. . . » 1) Anche Agnese, veda; anche Agnese. . . »
2) «Uh! ha voglia di scherzare, lei,» «Uh! ha voglia di scherzare, lei,» «Uh!
disse questa. disse questa. 2) ha voglia di scherzare, lei,» disse questa. «
Table 4 whether introduced by colons or not, and sometimes
Results on about 90 sentences taken from other 19th-century isolate a complete enunciative section. The long dash (–),
novels. Stanza retr. refers to the model retrained on instead, has a number of different functions [34]: one of
Manzoni’s novel, as described in Section 6. these is to signal direct speech, but often marking only
Malavoglia Pinocchio Cuore its beginning and not its end. This leads, on one hand,
spaCy 0.73 0.35 0.84 to a variety of ways of handling parenthetical elements
CoreNLP ssplit 0.76 0.72 0.62 and, on the other hand, to a blurred boundary between
SentenceSplit. 0.77 0.45 0.68 the characters’ speech, the characters’ speech mediated
UDPipe 0.75 0.79 0.67 by the narrator, and the narrator’s own discourse.
Stanza 0.71 0.70 0.61 “Pinocchio”, a novel written for a young audience, is
Stanza retr. 0.90 0.89 0.69 characterized by a strongly dialogic style [35]. For direct
Ersatz 0.72 0.75 0.66 speech, including the simulated dialogue between the
Punkt 0.73 0.77 0.66 narrator and the reader, the long dash (–) is abundantly
WtP 0.53 0.78 0.39
used, but as for "I Malavoglia", the opening dashes are not
always accompanied by the closing ones. Additionally,
Collodi frequently uses punctuation clusters, specifically
The results obtained are once again lower than those the exclamation mark followed by suspension points (!...),
reported for contemporary texts but the model retrained at the end of sentences [36], a possibility mostly not
on “I Promessi Sposi” shows improved performance for contemplated by late 19th-century grammars.
all novels, especially when applied on “I Malavoglia” and Lastly, Edmondo de Amicis’s novel “Cuore” tells the
on “Le avventure di Pinocchio” (+19 points with respect story of a child’s school experience from his point of view,
to the default Stanza combined model in both cases); adopting a diary-like structure. In “Cuore”, the linguistic
the improvement is more limited for “Cuore” (+ 8 points). form is simple and plain: the sentences are mainly short
The rule-based approach is promising but with dif- and often end with a standard strong punctuation mark,
ferent systems (spaCy for “Cuore” and ssplit for “I followed by a capital letter. Direct speech is clearly indi-
Malavoglia”). Instead, the VIT model of UDPipe, and cated by long dashes (–), but successive lines of dialogue
therefore a supervised approach, is the best on “Le avven- are arranged consecutively on the page, and in such cases,
ture di Pinocchio”. Some tools obtain extremely different the closing dash of the previous line also serves as the
results depending on the text they process. spaCy and opening dash of the next line. Since the lines of dialogue
Sentence Splitter record a very low result on “Le are perfectly integrated into the narrative structure, they
avventure di Pinocchio” (0.35 and 0.45 respectively) while can end with various punctuation marks, from commas
WtP has an F1 of only 0.39 on “Cuore”, half of what it to semicolons to full stops. When the punctuation mark
achieved on “Le avventure di Pinocchio”. is not strong, after the preliminary conclusion of the line,
This diversified situation is principally due to the fact the text continues with the narrator’s discourse.
that each novel presents unique characteristics, even in Beyond the specific differences listed schematically
punctuation. above, there are also some common typographical and
“I Malavoglia” is a choral novel in which the various punctuation features among the considered novels. For
styles of speech of the characters and the narrative voice example, when a closing quotation mark appears with
are mixed together. Punctuation marks largely represent another punctuation mark, the latter in general occurs
this mixture. Indeed, among the main peculiarities of before the former, as found in “I Promessi Sposi”.
the novel is the original and personal use of quotation
marks. For example, guillemets («,») are frequently used
to refer to popular sayings and proverbs as well as to short
formulas [33], which sometimes intersperse the diegesis,
8. Conclusions References
This paper presents an assessment of the performance [1] I. Bonomi, A. Masini, S. Morgana, M. Piotti, et al.,
of eight sentence splitting tools adopting different ap- Elementi di linguistica italiana, volume 103, Carocci,
proaches on four 19th-century novels: "I Promessi Sposi" 2010.
by Alessandro Manzoni, "I Malavoglia" by Giovanni [2] D. D. Palmer, Chapter 2: Tokenisation and sen-
Verga", "Le avventure di Pinocchio" by Carlo Collodi, and tence segmentation, Handbook of natural language
"Cuore" by Edmondo de Amicis. Although these texts processing (2007).
belong to the same historical period, they show specific [3] R. Dridan, S. Oepen, Document parsing: Towards
features depending on the form and content of the novel realistic syntactic analysis, in: Proceedings of The
as well as the author’s stylistic choices. Among these 13th International Conference on Parsing Technolo-
features is punctuation, which in the late 19th century gies (IWPT 2013), 2013, pp. 127–133.
had not reached a detectable stability yet and was rather [4] R. Wicks, M. Post, Does sentence segmentation
experiencing a paradigmatic change. matter for machine translation?, in: Proceedings
Since sentence splitting for Western languages, includ- of the Seventh Conference on Machine Translation
ing Italian, relies heavily on punctuation disambiguation, (WMT), 2022, pp. 843–854.
applying existing tools to the four novels considered has [5] Y. Liu, S. Xie, Impact of automatic sentence segmen-
resulted in performances well below the standards. These tation on meeting summarization, in: 2008 IEEE
texts demonstrate that sentence splitting is not a com- International Conference on Acoustics, Speech and
pletely solved task. Signal Processing, IEEE, 2008, pp. 5009–5012.
On the other hand, applying the model retrained on “I [6] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, C. D. Man-
Promessi Sposi” to the other three novels showed signifi- ning, Stanza: A Python natural language processing
cant improvements for “Le avventure di Pinocchio” and toolkit for many human languages, in: Proceed-
“I Malavoglia”, and a moderate improvement for “Cuore.” ings of the 58th Annual Meeting of the Associa-
This result suggests that shared historical context and tion for Computational Linguistics: System Demon-
belonging to the same textual genre may offer sufficient strations, 2020. URL: https://nlp.stanford.edu/pubs/
similarities to improve the model’s performance. How- qi2020stanza.pdf.
ever, the example of "Cuore" is evidence of how this is [7] C. Bosco, S. Montemagni, M. Simi, et al., Converting
sometimes not enough: some specific features in form, Italian Treebanks: Towards an Italian Stanford De-
punctuation and style continue to affect sentence split- pendency Treebank, in: Proceedings of the 7th Lin-
ting, demonstrating that although retraining may mit- guistic Annotation Workshop and Interoperability
igate some problems, it does not completely overcome with Discourse, The Association for Computational
the inherent variability of these texts. Linguistics, 2013, pp. 61–69.
Philologists have increasingly focused on preserving [8] M. Sanguinetti, C. Bosco, A. Lavelli, A. Mazzei,
the original punctuation as a part of the author’s creation O. Antonelli, F. Tamburini, PoSTWITA-UD: an
of the text, providing valuable and reliable supports of Italian Twitter treebank in Universal Dependen-
study for scholars of linguistics and the history of the Ital- cies, in: N. Calzolari, K. Choukri, C. Cieri, T. De-
ian language. Their combined knowledge is precious for clerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard,
achieving accurate sentence splitting in these texts. Thus, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis,
sentence splitting can be an interesting common ground T. Tokunaga (Eds.), Proceedings of the Eleventh In-
between different disciplines, potentially leading to the ternational Conference on Language Resources and
development of tools for the automatic analysis of his- Evaluation (LREC 2018), European Language Re-
torical literary texts. This field remains under-explored sources Association (ELRA), Miyazaki, Japan, 2018.
in the Italian context, offering significant opportunities URL: https://aclanthology.org/L18-1279.
for further study and cross-disciplinary collaboration. [9] E. Tonani, Premessa. Tra punteggiatura e ti-
pografia, in: E. Tonani (Ed.), Il romanzo
in bianco e nero. Ricerche sull’uso degli spazi
Acknowledgments bianchi e dell’interpunzione nella narrativa italiana
dall’Ottocento a oggi, Franco Cesati, Firenze, 2010,
Questa pubblicazione è stata realizzata da ricercatrice
pp. 13–28.
con contratto di ricerca cofinanziato dall’Unione europea
[10] A. Ferrari, Punteggiatura, in: G. Antonelli, M. Mo-
- PON Ricerca e Innovazione 2014-2020 ai sensi dell’art.
tolese, L. Tomasi (Eds.), Storia dell’italiano scritto.
24, comma 3, lett. a, della Legge 30 dicembre 2010, n. 240
Grammatiche, volume IV, Carocci, Roma, 2018, pp.
e s.m.i. e del D.M. 10 agosto 2021 n. 1062.
169–202.
[11] B. Mortara Garavelli, Prontuario di punteggiatura,
Laterza, Bari, 2003. [22] A. Ferrari, L. Lala, F. Longo, F. Pecorari, B. Rosi,
[12] A. Manzoni, F. Ghisalberti, A. Chiari, L’ultima re- R. Stojmenova, La punteggiatura italiana contem-
visione dei Promessi Sposi, in: Tutte le opere di poranea. Un’analisi comunicativo-testuale, Carocci,
Alessandro Manzoni. I Promessi Sposi, volume II, Roma, 2018.
Mondadori, Milano, 1954, pp. 789–989. [23] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel,
[13] M. Straka, UDPipe 2.0 prototype at CoNLL 2018 S. Bethard, D. McClosky, The Stanford CoreNLP
UD shared task, in: D. Zeman, J. Hajič (Eds.), Pro- natural language processing toolkit, in: Proceed-
ceedings of the CoNLL 2018 Shared Task: Multilin- ings of 52nd annual meeting of the association for
gual Parsing from Raw Text to Universal Depen- computational linguistics: system demonstrations,
dencies, Association for Computational Linguis- 2014, pp. 55–60.
tics, Brussels, Belgium, 2018, pp. 197–207. URL: [24] P. Koehn, Europarl: A parallel corpus for statistical
https://aclanthology.org/K18-2020. doi:10.18653/ machine translation, in: Proceedings of Machine
v1/K18-2020. Translation Summit X: Papers, Phuket, Thailand,
[14] M.-C. De Marneffe, C. D. Manning, J. Nivre, D. Ze- 2005, pp. 79–86. URL: https://aclanthology.org/2005.
man, Universal Dependencies, Computational lin- mtsummit-papers.11.
guistics 47 (2021) 255–308. [25] R. Delmonte, A. Bristot, S. Tonelli, VIT-Venice Ital-
[15] T. Kiss, J. Strunk, Unsupervised multilin- ian Treebank: Syntactic and quantitative features.,
gual sentence boundary detection, Computa- in: Sixth International Workshop on Treebanks and
tional Linguistics 32 (2006) 485–525. URL: https: Linguistic Theories, volume 1, Northern European
//aclanthology.org/J06-4003. doi:10.1162/coli. Association for Language Technol, 2007, pp. 43–54.
2006.32.4.485. [26] R. Wicks, M. Post, A unified approach to sentence
[16] Y. Liu, A. Stolcke, E. Shriberg, M. Harper, Using segmentation of punctuated text in many languages,
conditional random fields for sentence boundary in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Proceed-
detection in speech, in: Proceedings of the 43rd an- ings of the 59th Annual Meeting of the Association
nual meeting of the Association for Computational for Computational Linguistics and the 11th Interna-
Linguistics (ACL’05), 2005, pp. 451–458. tional Joint Conference on Natural Language Pro-
[17] R. Sheik, T. Gokul, S. Nirmala, Efficient deep cessing (Volume 1: Long Papers), Association for
learning-based sentence boundary detection in le- Computational Linguistics, Online, 2021, pp. 3995–
gal text, in: Proceedings of the Natural Legal Lan- 4007. URL: https://aclanthology.org/2021.acl-long.
guage Processing Workshop 2022, 2022, pp. 208– 309. doi:10.18653/v1/2021.acl-long.309.
217. [27] B. Minixhofer, J. Pfeiffer, I. Vulić, Where’s the
[18] D. Rudrapal, A. Jamatia, K. Chakma, A. Das, B. Gam- point? self-supervised multilingual punctuation-
bäck, Sentence boundary detection for social media agnostic sentence segmentation, in: A. Rogers,
text, in: Proceedings of the 12th International Con- J. Boyd-Graber, N. Okazaki (Eds.), Proceedings
ference on Natural Language Processing, 2015, pp. of the 61st Annual Meeting of the Association
254–260. for Computational Linguistics (Volume 1: Long
[19] A. A. Azzi, H. Bouamor, S. Ferradans, The FinSBD- Papers), Association for Computational Linguis-
2019 shared task: Sentence boundary detection in tics, Toronto, Canada, 2023, pp. 7215–7235. URL:
PDF noisy text in the financial domain, in: C.- https://aclanthology.org/2023.acl-long.398. doi:10.
C. Chen, H.-H. Huang, H. Takamura, H.-H. Chen 18653/v1/2023.acl-long.398.
(Eds.), Proceedings of the First Workshop on Fi- [28] A. Palmero Aprosio, G. Moretti, Tint 2.0: an all-
nancial Technology and Natural Language Process- inclusive suite for NLP in Italian, in: Proceedings
ing, Macao, China, 2019, pp. 74–80. URL: https: of the Fifth Italian Conference on Computational
//aclanthology.org/W19-5512. Linguistics (CLiC-it 2018), Accademia University
[20] T. Brugger, M. Stürmer, J. Niklaus, MultiLegalSBD: Press, 2018, pp. 311–317.
a multilingual legal sentence boundary detection [29] A. Manzoni, B. Colli, I Promessi Sposi. Edizione ge-
dataset, in: Proceedings of the Nineteenth Inter- netica della Quarantana, Casa del Manzoni, Milano,
national Conference on Artificial Intelligence and 2024.
Law, 2023, pp. 42–51. [30] G. Verga, F. Cecco, I Malavoglia, Fondazione Verga-
[21] J. Read, R. Dridan, S. Oepen, L. J. Solberg, Sen- Interlinea, Catania-Novara, 2014.
tence boundary detection: A long solved problem?, [31] C. Collodi, O. Castellani Pollidori, Le avventure
in: M. Kay, C. Boitet (Eds.), Proceedings of COL- di Pinocchio, Fondazione nazionale Carlo Collodi,
ING 2012: Posters, The COLING 2012 Organizing Pescia, 1983.
Committee, Mumbai, India, 2012, pp. 985–994. URL: [32] E. De Amicis, L. Tamburini, Cuore. Libro per
https://aclanthology.org/C12-2096. ragazzi, Einaudi, Torino, 2018 (1° ed. 1972).
[33] G. B. Bronzini, Proverbi, discorso e gesto prover-
biale nei «Malavoglia», in: I Malavoglia. Atti del
Congresso Internazionale di Studi (26-28 novembre
1981), Biblioteca della Fondazione Verga, Catania,
1982, pp. 637–684.
[34] E. Tonani, Il ’bianco di dialogato’ e il trattamento
tipografico del discorso diretto, in: E. Tonani
(Ed.), Il romanzo in bianco e nero. Ricerche sull’uso
degli spazi bianchi e dell’interpunzione nella nar-
rativa italiana dall’Ottocento a oggi, Franco Cesati,
Firenze, 2010, pp. 103–136.
[35] R. Pellerey, Pinocchio tra dialogo e scrittura,
Belfagor 60 (2005) 267–284. URL: https://www.jstor.
org/stable/26150287.
[36] O. Castellani Pollidori, Introduzione, in: C. Collodi,
O. Castellani Pollidori (Eds.), Le avventure di Pinoc-
chio, Fondazione nazionale Carlo Collodi, Pescia,
1983, pp. XIII–LXXXIV.