<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Is Sentence Splitting a Solved Task? Experiments to the Intersection Between NLP and Italian Linguistics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Arianna Redaelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rachele Sprugnoli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Università di Parma</institution>
          ,
          <addr-line>Via D'Azeglio, 85, 43125 Parma</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Sentence splitting, that is the segmentation of the raw input text into sentences, is a fundamental step in text processing. Although it is considered a solved task for texts such as news articles and Wikipedia pages, the performance of systems can vary greatly depending on the text genre. This paper presents the evaluation of the performance of eight sentence splitting tools adopting diferent approaches (rule-based, supervised, semi-supervised, and unsupervised learning) on Italian 19th-century novels, a genre that has not received suficient attention so far but which can be an interesting common ground between Natural Language Processing and Digital Humanities.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;sentence splitting</kwd>
        <kwd>text segmentation</kwd>
        <kwd>literary texts</kwd>
        <kwd>Italian</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        1. Introduction Stanza [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and spaCy2, have mostly been trained and
evaluated on fairly formal texts, such as news articles and
Sentence splitting is the process of segmenting a text Wikipedia pages, so the publicly reported performances
into sentences1 by detecting their boundaries, which, at tend to be high, i.e. above 0.90 in terms of F1. However,
least for Western languages, including Italian, usually the text genre has a significant impact on the results. For
correspond to certain punctuation marks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This means example, in the CoNLL 2018 shared task “Multilingual
that sentence splitting, for many languages, is a mat- Parsing from Raw Text to Universal Dependencies”, the
ter of punctuation disambiguation, that is, recognizing best system on the Italian ISDT treebank [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] achieved a
when a punctuation mark signals a sentence boundary F1 of 0.99, while on the PoSTWITA treebank, made of
or not. The importance of sentence splitting is often un- tweets [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the highest result was 0.66.
derestimated because it is considered an easy task, but its Given these variations, considering less formal text
quality has a strong impact on the quality of subsequent genres could provide valuable insights into the challenges
text processing because errors can propagate reducing of sentence splitting. Among these genres are literary
the performance of downstream tasks such as Syntac- texts, which present unique and peculiar stylistic and
tic Analysis [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Machine Translation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Automatic creative features that can break traditional grammatical
Summarization [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. norms, including punctuation ones [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. These features
deThe most popular pipeline models, such as those of pend on both authorial choices and the cultural context of
the time. As a matter of facts, punctuation can vary
significantly depending on the historical period; literary texts
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, may follow prevailing trends or oppose them, giving rise
Dec 04 — 06, 2024, Pisa, Italy to new trends. This phenomenon is particularly evident
* Corresponding author. in 19th century, when the Italian usus punctandi began
† Tthhoirssp.aFporerthise tshpeecrieficsucoltnocfertnhseocfotlhlaebIotraalitaionnacbaedtwemeeicnatthtreibtwutoioanu- shifting from a primarily syntactic usage, prescribed by
system: Rachele Sprugnoli is responsible for Sections 2, 3, 6; Ar- grammar books, to a communicative-textual usage of
ianna Redaelli is responsible for Sections 1, 4, 8. Section 7 were punctuation marks [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Since this shift was probably
collaboratively written by the two authors. influenced by the reflections and the practical uses of
$ arianna.redaelli@unipr.it (A. Redaelli); prominent authors such as Alessandro Manzoni [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], our
rachele.sprugnoli@unipr.it (R. Sprugnoli) study focuses on his historical novel, “I Promessi Sposi”.
(R. 0S0p0r0u-g0n0o01li-)6374-9033 (A. Redaelli); 0000-0001-6861-5595 The author paid meticulous attention to the punctuation
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License of the text, revising it up to the final print proofs, and
Attribution 4.0 International (CC BY 4.0).
1By "sentence" we mean a coherent set of words constructed ac- made specific and personal choices in collaboration with
cording to the general rules of the language, conveying a complete the publisher, alongside more classical ones [12].
Althought that makes sense on its own [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A sentence ends with though not always consistent, Manzoni’s decisions make
calasmtroantigonpupnocintut)aatinodnimstayrpkic(ea.lgly., ffoullllowstoedp,bqyuaesctaiopnitamllaertkte,ro.rTehxe- the novel particularly complex and interesting from a
definition of sentence adopted here, which like any definition is punctuation perspective. Furthermore, “I Promessi Sposi”
inherently problematic, is motivated by the specific requirements
of the present work, as will be seen below. 2https://spacy.io
has been a fundamental reference for the development of text genre on sentence splitting, but literary texts are
a common written Italian language: starting from this as- rarely considered. For example, Liu et al. [16] work on
sumption, many of the author’s punctuation choices have speech transcriptions, Sheik et al. [17] on legal texts, and
been adopted by later grammars for rule-making, though Rudrapal et al. [18] on social media posts. Moreover, a
only some of them have become part of the standard. shared task on sentence boundary detection in the
finanGiven that punctuation was still undergoing standard- cial domain (FinSBD) was organized in 2019, 2020 and
ization at the time, and that its use can depend not only 2021 [19].
on the conventions of the period but also on the writer’s Most of the available studies concern the processing
personal style, the type of content being addressed (and of English texts while Italian is usually not included in
how it is presented), and even the influence of typog- the evaluation. An interesting exception is given by a
raphy during the printing process, we also decided to work on multilingual legal texts that contains a detailed
broaden our study to include sections from other novels evaluation of the results on Italian documents [20].
contemporary to Manzoni’s (1840-42). Specifically, we Our work draws inspiration from the assessment on
analyzed "I Malavoglia" (1881) by Giovanni Verga, "Le English texts provided by Read et al. [21] which includes,
avventure di Pinocchio. Storia di un burattino" (1883) by among others, the Sherlock Holmes stories, but moving
Carlo Collodi, and "Cuore" (1886) by Edmondo de Amicis. to the Italian context. Furthermore, we focus on the
      </p>
      <p>In this paper, our main contributions are as follows: literary context showing how 19th-century novels are a
(i) we provide an estimate of the performance of eight challenge for current sentence splitting systems.
sentence splitting tools adopting diferent approaches on
a specific and challenging text genre, namely historical
literary fiction texts, which has not received enough at- 3. Tools
tention so far; (ii) we compare the results considering the
point of view of humanities scholars (in particular Italian
linguistics) as the main stakeholders in the considered
domain, in order to establish a flourishing cross-fertilization
between NLP and Digital Humanities; (iii) we release
manually split data for four 19th-century Italian novels
and a shared notebook where to run many of the tested
systems.3</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>Sentence splitting systems can be categorized into three</title>
        <p>macro-classes based on the approach used to develop
them. There are rule-based systems, such as Sentence
Splitter4 and the Sentencizer module of spaCy, that
use heuristics specific to the various languages and lists
of exceptions and abbreviations. Then, there are
supervised systems that need datasets in which sentences are
already correctly segmented to be trained. For example,
UDPipe [13] and Stanza are trained on Universal
Dependencies (UD) treebanks [14]. Finally, unsupervised
systems are trained on datasets of non-segmented texts
taking advantage of features such as the length of words
and collocational information. An example is given by
Punkt, available as a module within the NLTK (Natural
Language Toolkit) library [15]. In our work, we test these
various approaches on a benchmark dataset of historical
literary fiction texts by evaluating the performance of
eight diferent systems.</p>
        <p>There are several studies that analyze the impact of</p>
      </sec>
      <sec id="sec-2-2">
        <title>3https://github.com/RacheleSprugnoli/Sentence_Splitting_</title>
        <p>Manzoni
4https://github.com/mediacloud/sentence-splitter
Sentence splitting is a fundamental analysis in text
processing, for which there are many tools available, also
for Italian. For our evaluation we have selected eight
tools developed with diferent approaches. Some tools
are modules integrated in larger pipelines, others are
systems specifically created to perform only sentence
splitting. It is important to note that selected tools do
not split in the presence of a colon or semicolon. Indeed,
although recent studies in the punctuation field identify
the colons and semicolons as punctuation marks capable
of indicating the boundary of a sentence [22], as
anticipated in footnote 1, in this work we have decided to
not consider them as separating marks because of the
various forms literary texts can take. To clarify the
issue, we can consider the example of direct speech. In “I
Promessi Sposi”, direct speech can be introduced by a
verbum dicendi and the colons, continuing without any
interruption. In such cases, splitting at the colons would
be relatively easy. However, direct speech can also be
embedded within a sentence that continues after the
quotation closes, creating a non-autonomous text portion
that, during sentence splitting, should be manually
reconnected to the one preceding the quotation itself (e.g.,
Lucia sospirò, e ripeté: «coraggio,» con una voce che
smentiva la parola. EN: Lucia sighed, and repeated, «courage,»
in a voice that belied the word.). An equally troublesome
problem arises when the diegetic frame follows the
quotation instead of preceding it. When this happens, the
colons are absent, and other punctuation marks like
commas are found before the closing quotation marks or dash
(e.g., «È il mio caso,» disse Renzo. EN: «That’s my case,»
said Renzo.). The system would not split the sentences at
these punctuation marks, yet the diegetic frame
following the direct speech has the same value and autonomy as
the one preceding it. Consequently, considering colons
and semicolons as sentence boundaries would make the
segmentation much more complex and often inaccurate.</p>
        <p>Selected tools are the following:
• WtP10: an unsupervised multilingual sentence
segmentation system based on a self-supervised
learning approach tested on 85 languages,
including Italian. It does not rely on
punctuation or sentence-segmented training data thus it
is a punctuation-agnostic system [27]. Among
the various available models, we adopted the
wtp-canine-s-12l which, according to the
ofifcial documentation of the tool, have the best
results on languages other than English.
• CoreNLP5: an NLP pipeline written in Java and
developed by Stanford University [23]. It contains
various modules including ssplit that divides
a text into sentences via a set of rules. The
latest version of the pipeline (4.5.7) supports eight
languages including Italian. For the evaluation, the tools were used as they are,
• spaCy: an open-source NLP library which sup- using their default configurations, without making any
ports dozens of languages, including Italian, and customization. For this reason, given the choices
motiprovides four alternatives for sentence splitting. vated above, we did not consider other systems, such as
Among these, statistical models for Italian have Tint [28], which by default split at colons and semicolons.
been trained to split on colons and semicolons.</p>
        <p>For this reason, we tested the performance only 4. Dataset
of Sentencizer, the rule-based pipeline
component.
• Sentence Splitter6: a Python module based
on scripts developed for processing the Europarl
corpus [24]. It supports several languages with
ad-hoc rules.</p>
        <p>The data used to evaluate the aforementioned tools are
taken from “I Promessi Sposi” in its final version
published in 1840-184211. 3,095 sentences, corresponding
to 12 chapters of the novel, were manually split. This
dataset was divided into training, development and test
• UDPipe7: an NLP pipeline based on the UD frame- sets according to the proportions 80/10/10 and using the
work performing tokenization, sentence splitting, UD rules for which this proportion was calculated using
PoS tagging, lemmatization and syntactic analy- syntactic words as units.12 To obtain syntactic words
sis. UDPipe 2 is written in Python and uses the and calculate this splitting, sentences were segmented
tokenizer of UDPipe 1; among the 131 most re- and tokenized by hand; this gold standard was then
procent models (version 2.12), seven are for Italian. cessed with the combined Stanza model.13 Following this
We evaluated the model trained on the VIT tree- division, the test set is made of 324 sentences.
bank [25] that does not (always) split at colons Table 1 shows the sentence-ending punctuation marks
and semicolons. in the test set. Both the total number of occurrences
• Stanza8: an NLP package written in Python and (TOTAL) and the number of times a sign is an
end-ofbased on neural network components. Sentence sentence marker (EOS) are reported. In addition to the
splitting is jointly performed with tokenization by full stop, sentence boundaries can be indicated by
exthe TokenizeProcessor module. The default pressive punctuation marks (!, ?) when followed by a
Italian model is a combination of multiple UD capital letter. If followed by a lowercase letter, instead,
treebanks. these marks only have an expressive role, modifying
• Ersatz9: a language-agnostic neural model the sentence’s internal intonation without determining
based on a semi-supervised training paradigm. its end. Low quotation marks («») and long dashes (–),
It combines the use of regular-expressions to used for direct speech and thoughts respectively,
typidetect candidate sentence boundaries with a cally determine a sentence boundary when they appear
Transformer-based binary classifier [26]. with another demarcative punctuation mark (e.g., a full
• Punkt: an unsupervised system which uses col- stop). In Manzoni’s novel, if a closing quotation mark
locational information to identify abbreviations, (guillemets or long dashes) appears with another
punctuinitials, and ordinal numbers. All punctuation ation mark, the latter is usually placed before the former,
not included in these elements is considered an
end-of-sentence marker.
5https://stanfordnlp.github.io/CoreNLP/
6https://github.com/mediacloud/sentence-splitter
7https://ufal.mf.cuni.cz/udpipe
8https://stanfordnlp.github.io/stanza/
9https://github.com/rewicks/ersatz
10https://github.com/segment-any-text/wtpsplit
11The text, fully digitized and available online, was collated with
the reference edition [29] prior to analysis, to ensure maximum
ifdelity to the author’s punctuation choices.
12https://universaldependencies.org/release_checklist.html#</p>
        <p>data-split
13The output of this process was used to train a new Stanza model
as reported in Section 6.
which formally closes the sentence. Lastly, in the novel,
suspension points (...) can indicate a sentence
boundary when they suggest a suspensive allusion or when
they mark the interruption of a character’s line due to
linguistic or extra-linguistic contingencies. In such cases,
suspension points’ demarcative function is shown either
by the following capital letter or by an opening
quotation mark which indicates the beginning of a diferent
character’s line.
5. Results of the Evaluation
sign of the low quotation marks is not recognized
as a sentence boundary, so in the automatic
segmentation it can appear at the beginning or in
the middle of a sentence.
2. In supervised systems semicolons and colons are
sometimes considered as sentence boundary
signals. Indeed, in the VIT treebank and in those
used to train the combined Stanza model,
sentences are segmented inconsistently: sometimes
semicolons and colons are strong punctuation,
and sometimes not.
3. Suspension points are always considered strong
punctuation marks and the sentence is splitted
after them.
4. A sentence is often split after an expressive
punctuation mark (?, !) even if it is followed by a
lowercase letter.
5. The long dash is not recognized as a
sentenceending marker; consequently, either the sentence
continues after the dash or the dash appears at
the beginning of the following sentence.</p>
        <p>Table 2 reports the results of our evaluation in terms
of F1. The best performance (0.94) is registered with With the rest of the manually split data, namely 2,447
Sentence Splitter, a rule-based system. All other sentences for the training set and 324 for the development
tools do not exceed 0.70, thus having significantly lower set, a new Stanza model specific for Manzoni’s text was
performances than those reported on contemporary Ital- trained. Diferent amounts of sentences were used as
ian texts. For example, the oficial result of UDPipe 2 training in order to control the efect of the dataset size
on the VIT treebank with the 2.12 model starting from on the performance. The results obtained with 1500 steps
a raw text is 0.95, that is almost 30 points more than are the following:
what is obtained on our test set. The lowest result (0.51)
is obtained by the unsupervised WtP system. Although • 300 sentences: 0.97 F1
the rule-based approach seems to be the most promising, • 1000 sentences: 0.98 F1
only Sentence Splitter has an excellent result even • 2,447 sentences: 0.99 F1
without any adaptation of the existing rules.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6. Training a New Stanza Model</title>
      <sec id="sec-3-1">
        <title>1. Misinterpretation of guillemets («,»). The closing</title>
      </sec>
      <sec id="sec-3-2">
        <title>With just 300 sentences there is already a clear improvement over the default model, obtaining an even higher result than the one obtained with Sentence Splitter, the system that had proven to be the best on our test set.</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>7. What About Other Novels?</title>
      <p>14The reference edition text was used for the analysis of these novels</p>
      <p>too.
1586 sentences are taken from “I Malavoglia”, corresponding to the
ifrst chapter of the novel; 93 sentences, that is the first two chapters,
come from “Le avventure di Pinocchio”; 87 sentences are taken
“Cuore”, corresponding to the first three chapters of the novel.
Table 4 whether introduced by colons or not, and sometimes
Results on about 90 sentences taken from other 19th-century isolate a complete enunciative section. The long dash (–),
novels. Stanza retr. refers to the model retrained on instead, has a number of diferent functions [ 34]: one of
Manzoni’s novel, as described in Section 6. these is to signal direct speech, but often marking only</p>
      <p>Malavoglia Pinocchio Cuore its beginning and not its end. This leads, on one hand,
spaCy 0.73 0.35 0.84 to a variety of ways of handling parenthetical elements
CoreNLP ssplit 0.76 0.72 0.62 and, on the other hand, to a blurred boundary between
SentenceSplit. 0.77 0.45 0.68 the characters’ speech, the characters’ speech mediated
UDPipe 0.75 0.79 0.67 by the narrator, and the narrator’s own discourse.
Stanza 0.71 0.70 0.61 “Pinocchio”, a novel written for a young audience, is
Stanza retr. 0.90 0.89 0.69 characterized by a strongly dialogic style [35]. For direct
Ersatz 0.72 0.75 0.66 speech, including the simulated dialogue between the
Punkt 0.73 0.77 0.66 narrator and the reader, the long dash (–) is abundantly
WtP 0.53 0.78 0.39 used, but as for "I Malavoglia", the opening dashes are not
always accompanied by the closing ones. Additionally,</p>
      <p>Collodi frequently uses punctuation clusters, specifically</p>
      <p>The results obtained are once again lower than those the exclamation mark followed by suspension points (!...),
reported for contemporary texts but the model retrained at the end of sentences [36], a possibility mostly not
on “I Promessi Sposi” shows improved performance for contemplated by late 19th-century grammars.
all novels, especially when applied on “I Malavoglia” and Lastly, Edmondo de Amicis’s novel “Cuore” tells the
on “Le avventure di Pinocchio” (+19 points with respect story of a child’s school experience from his point of view,
to the default Stanza combined model in both cases); adopting a diary-like structure. In “Cuore”, the linguistic
the improvement is more limited for “Cuore” (+ 8 points). form is simple and plain: the sentences are mainly short</p>
      <p>The rule-based approach is promising but with dif- and often end with a standard strong punctuation mark,
ferent systems (spaCy for “Cuore” and ssplit for “I followed by a capital letter. Direct speech is clearly
indiMalavoglia”). Instead, the VIT model of UDPipe, and cated by long dashes (–), but successive lines of dialogue
therefore a supervised approach, is the best on “Le avven- are arranged consecutively on the page, and in such cases,
ture di Pinocchio”. Some tools obtain extremely diferent the closing dash of the previous line also serves as the
results depending on the text they process. spaCy and opening dash of the next line. Since the lines of dialogue
Sentence Splitter record a very low result on “Le are perfectly integrated into the narrative structure, they
avventure di Pinocchio” (0.35 and 0.45 respectively) while can end with various punctuation marks, from commas
WtP has an F1 of only 0.39 on “Cuore”, half of what it to semicolons to full stops. When the punctuation mark
achieved on “Le avventure di Pinocchio”. is not strong, after the preliminary conclusion of the line,</p>
      <p>This diversified situation is principally due to the fact the text continues with the narrator’s discourse.
that each novel presents unique characteristics, even in Beyond the specific diferences listed schematically
punctuation. above, there are also some common typographical and
“I Malavoglia” is a choral novel in which the various punctuation features among the considered novels. For
styles of speech of the characters and the narrative voice example, when a closing quotation mark appears with
are mixed together. Punctuation marks largely represent another punctuation mark, the latter in general occurs
this mixture. Indeed, among the main peculiarities of before the former, as found in “I Promessi Sposi”.
the novel is the original and personal use of quotation
marks. For example, guillemets («,») are frequently used
to refer to popular sayings and proverbs as well as to short
formulas [33], which sometimes intersperse the diegesis,</p>
    </sec>
    <sec id="sec-5">
      <title>8. Conclusions</title>
      <sec id="sec-5-1">
        <title>This paper presents an assessment of the performance</title>
        <p>of eight sentence splitting tools adopting diferent
approaches on four 19th-century novels: "I Promessi Sposi"
by Alessandro Manzoni, "I Malavoglia" by Giovanni
Verga", "Le avventure di Pinocchio" by Carlo Collodi, and
"Cuore" by Edmondo de Amicis. Although these texts
belong to the same historical period, they show specific
features depending on the form and content of the novel
as well as the author’s stylistic choices. Among these
features is punctuation, which in the late 19th century
had not reached a detectable stability yet and was rather
experiencing a paradigmatic change.</p>
        <p>Since sentence splitting for Western languages,
including Italian, relies heavily on punctuation disambiguation,
applying existing tools to the four novels considered has
resulted in performances well below the standards. These
texts demonstrate that sentence splitting is not a
completely solved task.</p>
        <p>On the other hand, applying the model retrained on “I
Promessi Sposi” to the other three novels showed
significant improvements for “Le avventure di Pinocchio” and
“I Malavoglia”, and a moderate improvement for “Cuore.”
This result suggests that shared historical context and
belonging to the same textual genre may ofer suficient
similarities to improve the model’s performance.
However, the example of "Cuore" is evidence of how this is
sometimes not enough: some specific features in form,
punctuation and style continue to afect sentence
splitting, demonstrating that although retraining may
mitigate some problems, it does not completely overcome
the inherent variability of these texts.</p>
        <p>Philologists have increasingly focused on preserving
the original punctuation as a part of the author’s creation
of the text, providing valuable and reliable supports of
study for scholars of linguistics and the history of the
Italian language. Their combined knowledge is precious for
achieving accurate sentence splitting in these texts. Thus,
sentence splitting can be an interesting common ground
between diferent disciplines, potentially leading to the
development of tools for the automatic analysis of
historical literary texts. This field remains under-explored
in the Italian context, ofering significant opportunities
for further study and cross-disciplinary collaboration.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <sec id="sec-6-1">
        <title>Questa pubblicazione è stata realizzata da ricercatrice</title>
        <p>con contratto di ricerca cofinanziato dall’Unione europea
- PON Ricerca e Innovazione 2014-2020 ai sensi dell’art.
24, comma 3, lett. a, della Legge 30 dicembre 2010, n. 240
e s.m.i. e del D.M. 10 agosto 2021 n. 1062.</p>
        <p>Laterza, Bari, 2003. [22] A. Ferrari, L. Lala, F. Longo, F. Pecorari, B. Rosi,
[12] A. Manzoni, F. Ghisalberti, A. Chiari, L’ultima re- R. Stojmenova, La punteggiatura italiana
contemvisione dei Promessi Sposi, in: Tutte le opere di poranea. Un’analisi comunicativo-testuale, Carocci,
Alessandro Manzoni. I Promessi Sposi, volume II, Roma, 2018.</p>
        <p>Mondadori, Milano, 1954, pp. 789–989. [23] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel,
[13] M. Straka, UDPipe 2.0 prototype at CoNLL 2018 S. Bethard, D. McClosky, The Stanford CoreNLP
UD shared task, in: D. Zeman, J. Hajič (Eds.), Pro- natural language processing toolkit, in:
Proceedceedings of the CoNLL 2018 Shared Task: Multilin- ings of 52nd annual meeting of the association for
gual Parsing from Raw Text to Universal Depen- computational linguistics: system demonstrations,
dencies, Association for Computational Linguis- 2014, pp. 55–60.
tics, Brussels, Belgium, 2018, pp. 197–207. URL: [24] P. Koehn, Europarl: A parallel corpus for statistical
https://aclanthology.org/K18-2020. doi:10.18653/ machine translation, in: Proceedings of Machine
v1/K18-2020. Translation Summit X: Papers, Phuket, Thailand,
[14] M.-C. De Marnefe, C. D. Manning, J. Nivre, D. Ze- 2005, pp. 79–86. URL: https://aclanthology.org/2005.
man, Universal Dependencies, Computational lin- mtsummit-papers.11.</p>
        <p>guistics 47 (2021) 255–308. [25] R. Delmonte, A. Bristot, S. Tonelli, VIT-Venice
Ital[15] T. Kiss, J. Strunk, Unsupervised multilin- ian Treebank: Syntactic and quantitative features.,
gual sentence boundary detection, Computa- in: Sixth International Workshop on Treebanks and
tional Linguistics 32 (2006) 485–525. URL: https: Linguistic Theories, volume 1, Northern European
//aclanthology.org/J06-4003. doi:10.1162/coli. Association for Language Technol, 2007, pp. 43–54.
2006.32.4.485. [26] R. Wicks, M. Post, A unified approach to sentence
[16] Y. Liu, A. Stolcke, E. Shriberg, M. Harper, Using segmentation of punctuated text in many languages,
conditional random fields for sentence boundary in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.),
Proceeddetection in speech, in: Proceedings of the 43rd an- ings of the 59th Annual Meeting of the Association
nual meeting of the Association for Computational for Computational Linguistics and the 11th
InternaLinguistics (ACL’05), 2005, pp. 451–458. tional Joint Conference on Natural Language
Pro[17] R. Sheik, T. Gokul, S. Nirmala, Eficient deep cessing (Volume 1: Long Papers), Association for
learning-based sentence boundary detection in le- Computational Linguistics, Online, 2021, pp. 3995–
gal text, in: Proceedings of the Natural Legal Lan- 4007. URL: https://aclanthology.org/2021.acl-long.
guage Processing Workshop 2022, 2022, pp. 208– 309. doi:10.18653/v1/2021.acl-long.309.
217. [27] B. Minixhofer, J. Pfeifer, I. Vulić, Where’s the
[18] D. Rudrapal, A. Jamatia, K. Chakma, A. Das, B. Gam- point? self-supervised multilingual
punctuationbäck, Sentence boundary detection for social media agnostic sentence segmentation, in: A. Rogers,
text, in: Proceedings of the 12th International Con- J. Boyd-Graber, N. Okazaki (Eds.), Proceedings
ference on Natural Language Processing, 2015, pp. of the 61st Annual Meeting of the Association
254–260. for Computational Linguistics (Volume 1: Long
[19] A. A. Azzi, H. Bouamor, S. Ferradans, The FinSBD- Papers), Association for Computational
Linguis2019 shared task: Sentence boundary detection in tics, Toronto, Canada, 2023, pp. 7215–7235. URL:
PDF noisy text in the financial domain, in: C.- https://aclanthology.org/2023.acl-long.398. doi:10.
C. Chen, H.-H. Huang, H. Takamura, H.-H. Chen 18653/v1/2023.acl-long.398.
(Eds.), Proceedings of the First Workshop on Fi- [28] A. Palmero Aprosio, G. Moretti, Tint 2.0: an
allnancial Technology and Natural Language Process- inclusive suite for NLP in Italian, in: Proceedings
ing, Macao, China, 2019, pp. 74–80. URL: https: of the Fifth Italian Conference on Computational
//aclanthology.org/W19-5512. Linguistics (CLiC-it 2018), Accademia University
[20] T. Brugger, M. Stürmer, J. Niklaus, MultiLegalSBD: Press, 2018, pp. 311–317.</p>
        <p>a multilingual legal sentence boundary detection [29] A. Manzoni, B. Colli, I Promessi Sposi. Edizione
gedataset, in: Proceedings of the Nineteenth Inter- netica della Quarantana, Casa del Manzoni, Milano,
national Conference on Artificial Intelligence and 2024.</p>
        <p>Law, 2023, pp. 42–51. [30] G. Verga, F. Cecco, I Malavoglia, Fondazione
Verga[21] J. Read, R. Dridan, S. Oepen, L. J. Solberg, Sen- Interlinea, Catania-Novara, 2014.
tence boundary detection: A long solved problem?, [31] C. Collodi, O. Castellani Pollidori, Le avventure
in: M. Kay, C. Boitet (Eds.), Proceedings of COL- di Pinocchio, Fondazione nazionale Carlo Collodi,
ING 2012: Posters, The COLING 2012 Organizing Pescia, 1983.</p>
        <p>Committee, Mumbai, India, 2012, pp. 985–994. URL: [32] E. De Amicis, L. Tamburini, Cuore. Libro per
https://aclanthology.org/C12-2096. ragazzi, Einaudi, Torino, 2018 (1° ed. 1972).
[33] G. B. Bronzini, Proverbi, discorso e gesto
proverbiale nei «Malavoglia», in: I Malavoglia. Atti del
Congresso Internazionale di Studi (26-28 novembre
1981), Biblioteca della Fondazione Verga, Catania,
1982, pp. 637–684.
[34] E. Tonani, Il ’bianco di dialogato’ e il trattamento
tipografico del discorso diretto, in: E. Tonani
(Ed.), Il romanzo in bianco e nero. Ricerche sull’uso
degli spazi bianchi e dell’interpunzione nella
narrativa italiana dall’Ottocento a oggi, Franco Cesati,</p>
        <p>Firenze, 2010, pp. 103–136.
[35] R. Pellerey, Pinocchio tra dialogo e scrittura,</p>
        <p>Belfagor 60 (2005) 267–284. URL: https://www.jstor.</p>
        <p>org/stable/26150287.
[36] O. Castellani Pollidori, Introduzione, in: C. Collodi,</p>
        <p>O. Castellani Pollidori (Eds.), Le avventure di
Pinocchio, Fondazione nazionale Carlo Collodi, Pescia,
1983, pp. XIII–LXXXIV.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I.</given-names>
            <surname>Bonomi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Masini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Morgana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Piotti</surname>
          </string-name>
          , et al.,
          <source>Elementi di linguistica italiana</source>
          , volume
          <volume>103</volume>
          ,
          <string-name>
            <surname>Carocci</surname>
          </string-name>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Palmer</surname>
          </string-name>
          ,
          <article-title>Chapter 2: Tokenisation and sentence segmentation, Handbook of natural language processing (</article-title>
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Dridan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Oepen</surname>
          </string-name>
          ,
          <article-title>Document parsing: Towards realistic syntactic analysis</article-title>
          ,
          <source>in: Proceedings of The 13th International Conference on Parsing Technologies (IWPT</source>
          <year>2013</year>
          ),
          <year>2013</year>
          , pp.
          <fpage>127</fpage>
          -
          <lpage>133</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Wicks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Post</surname>
          </string-name>
          ,
          <article-title>Does sentence segmentation matter for machine translation?</article-title>
          ,
          <source>in: Proceedings of the Seventh Conference on Machine Translation (WMT)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>843</fpage>
          -
          <lpage>854</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <article-title>Impact of automatic sentence segmentation on meeting summarization</article-title>
          ,
          <source>in: 2008 IEEE International Conference on Acoustics, Speech and Signal Processing</source>
          , IEEE,
          <year>2008</year>
          , pp.
          <fpage>5009</fpage>
          -
          <lpage>5012</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bolton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>Stanza: A Python natural language processing toolkit for many human languages</article-title>
          ,
          <source>in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations</source>
          ,
          <year>2020</year>
          . URL: https://nlp.stanford.edu/pubs/ qi2020stanza.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Montemagni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Simi</surname>
          </string-name>
          , et al.,
          <source>Converting Italian Treebanks: Towards an Italian Stanford Dependency Treebank, in: Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse</source>
          ,
          <source>The Association for Computational Linguistics</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>61</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mazzei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Antonelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tamburini</surname>
          </string-name>
          ,
          <article-title>PoSTWITA-UD: an Italian Twitter treebank in Universal Dependencies</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Choukri</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cieri</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Declerck</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Goggi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Hasida</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Isahara</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Maegaard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mariani</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Mazo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Odijk</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Piperidis</surname>
          </string-name>
          , T. Tokunaga (Eds.),
          <source>Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          ),
          <article-title>European Language Resources Association (ELRA), Miyazaki</article-title>
          , Japan,
          <year>2018</year>
          . URL: https://aclanthology.org/L18-1279.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Tonani</surname>
          </string-name>
          , Premessa. Tra punteggiatura e tipografia, in: E. Tonani (Ed.),
          <article-title>Il romanzo in bianco e nero. Ricerche sull'uso degli spazi bianchi e dell'interpunzione nella narrativa italiana dall'Ottocento a oggi, Franco Cesati</article-title>
          , Firenze,
          <year>2010</year>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ferrari</surname>
          </string-name>
          , Punteggiatura, in: G. Antonelli,
          <string-name>
            <given-names>M.</given-names>
            <surname>Motolese</surname>
          </string-name>
          , L. Tomasi (Eds.),
          <article-title>Storia dell'italiano scritto</article-title>
          .
          <source>Grammatiche</source>
          , volume IV, Carocci, Roma,
          <year>2018</year>
          , pp.
          <fpage>169</fpage>
          -
          <lpage>202</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mortara Garavelli</surname>
          </string-name>
          , Prontuario di punteggiatura,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>