=Paper=
{{Paper
|id=Vol-3878/88_main_long
|storemode=property
|title=Is Sentence Splitting a Solved Task? Experiments to the Intersection between NLP and Italian Linguistics
|pdfUrl=https://ceur-ws.org/Vol-3878/88_main_long.pdf
|volume=Vol-3878
|authors=Arianna Redaelli,Rachele Sprugnoli
|dblpUrl=https://dblp.org/rec/conf/clic-it/RedaelliS24
}}
==Is Sentence Splitting a Solved Task? Experiments to the Intersection between NLP and Italian Linguistics==
<pdf width="1500px">https://ceur-ws.org/Vol-3878/88_main_long.pdf</pdf>
<pre>
                                Is Sentence Splitting a Solved Task? Experiments to the
                                Intersection Between NLP and Italian Linguistics
                                Arianna Redaelli1 , Rachele Sprugnoli1,*
                                1
                                    Università di Parma, Via D’Azeglio, 85, 43125 Parma, Italy


                                                Abstract
                                                 Sentence splitting, that is the segmentation of the raw input text into sentences, is a fundamental step in text processing.
                                                 Although it is considered a solved task for texts such as news articles and Wikipedia pages, the performance of systems
                                                 can vary greatly depending on the text genre. This paper presents the evaluation of the performance of eight sentence
                                                 splitting tools adopting different approaches (rule-based, supervised, semi-supervised, and unsupervised learning) on Italian
                                                 19th-century novels, a genre that has not received sufficient attention so far but which can be an interesting common ground
                                                 between Natural Language Processing and Digital Humanities.

                                                 Keywords
                                                 sentence splitting, text segmentation, literary texts, Italian


                                1. Introduction                                                                                          Stanza [6] and spaCy2 , have mostly been trained and
                                                                                                                                         evaluated on fairly formal texts, such as news articles and
                                Sentence splitting is the process of segmenting a text Wikipedia pages, so the publicly reported performances
                                into sentences1 by detecting their boundaries, which, at tend to be high, i.e. above 0.90 in terms of F1. However,
                                least for Western languages, including Italian, usually the text genre has a significant impact on the results. For
                                correspond to certain punctuation marks [2]. This means example, in the CoNLL 2018 shared task “Multilingual
                                that sentence splitting, for many languages, is a mat- Parsing from Raw Text to Universal Dependencies”, the
                                ter of punctuation disambiguation, that is, recognizing best system on the Italian ISDT treebank [7] achieved a
                                when a punctuation mark signals a sentence boundary F1 of 0.99, while on the PoSTWITA treebank, made of
                                or not. The importance of sentence splitting is often un- tweets [8], the highest result was 0.66.
                                derestimated because it is considered an easy task, but its                                                 Given these variations, considering less formal text
                                quality has a strong impact on the quality of subsequent genres could provide valuable insights into the challenges
                                text processing because errors can propagate reducing of sentence splitting. Among these genres are literary
                                the performance of downstream tasks such as Syntac- texts, which present unique and peculiar stylistic and
                                tic Analysis [3], Machine Translation [4] and Automatic creative features that can break traditional grammatical
                                Summarization [5].                                                                                       norms, including punctuation ones [9]. These features de-
                                     The most popular pipeline models, such as those of pend on both authorial choices and the cultural context of
                                                                                                                                         the time. As a matter of facts, punctuation can vary signif-
                                                                                                                                         icantly depending on the historical period; literary texts
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, may follow prevailing trends or oppose them, giving rise
                                Dec 04 — 06, 2024, Pisa, Italy                                                                           to new trends. This phenomenon is particularly evident
                                *
                                   Corresponding author.
                                †                                                                                                        in 19th century, when the Italian usus punctandi began
                                  This paper is the result of the collaboration between the two au-
                                                                                                                                         shifting from a primarily syntactic usage, prescribed by
                                   thors. For the specific concerns of the Italian academic attribution
                                   system: Rachele Sprugnoli is responsible for Sections 2, 3, 6; Ar- grammar books, to a communicative-textual usage of
                                   ianna Redaelli is responsible for Sections 1, 4, 8. Section 7 were punctuation marks [10]. Since this shift was probably
                                   collaboratively written by the two authors.                                                           influenced by the reflections and the practical uses of
                                $ arianna.redaelli@unipr.it (A. Redaelli);                                                               prominent authors such as Alessandro Manzoni [11], our
                                rachele.sprugnoli@unipr.it (R. Sprugnoli)
                                                                                                                                         study focuses on his historical novel, “I Promessi Sposi”.
                                 0000-0001-6374-9033 (A. Redaelli); 0000-0001-6861-5595
                                (R. Sprugnoli)                                                                                           The  author paid meticulous attention to the punctuation
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License of the text, revising it up to the final print proofs, and
                                            Attribution 4.0 International (CC BY 4.0).
                                1
                                  By "sentence" we mean a coherent set of words constructed ac- made specific and personal choices in collaboration with
                                  cording to the general rules of the language, conveying a complete the publisher, alongside more classical ones [12]. Al-
                                  thought that makes sense on its own [1]. A sentence ends with
                                                                                                                                         though not always consistent, Manzoni’s decisions make
                                  a strong punctuation mark (e.g., full stop, question mark, or ex-
                                  clamation point) and is typically followed by a capital letter. The the novel particularly complex and interesting from a
                                  definition of sentence adopted here, which like any definition is punctuation perspective. Furthermore, “I Promessi Sposi”
                                    inherently problematic, is motivated by the specific requirements
                                                                                                                    2
                                    of the present work, as will be seen below.                                         https://spacy.io


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
has been a fundamental reference for the development of         text genre on sentence splitting, but literary texts are
a common written Italian language: starting from this as-       rarely considered. For example, Liu et al. [16] work on
sumption, many of the author’s punctuation choices have         speech transcriptions, Sheik et al. [17] on legal texts, and
been adopted by later grammars for rule-making, though          Rudrapal et al. [18] on social media posts. Moreover, a
only some of them have become part of the standard.             shared task on sentence boundary detection in the finan-
Given that punctuation was still undergoing standard-           cial domain (FinSBD) was organized in 2019, 2020 and
ization at the time, and that its use can depend not only       2021 [19].
on the conventions of the period but also on the writer’s          Most of the available studies concern the processing
personal style, the type of content being addressed (and        of English texts while Italian is usually not included in
how it is presented), and even the influence of typog-          the evaluation. An interesting exception is given by a
raphy during the printing process, we also decided to           work on multilingual legal texts that contains a detailed
broaden our study to include sections from other novels         evaluation of the results on Italian documents [20].
contemporary to Manzoni’s (1840-42). Specifically, we              Our work draws inspiration from the assessment on
analyzed "I Malavoglia" (1881) by Giovanni Verga, "Le           English texts provided by Read et al. [21] which includes,
avventure di Pinocchio. Storia di un burattino" (1883) by       among others, the Sherlock Holmes stories, but moving
Carlo Collodi, and "Cuore" (1886) by Edmondo de Amicis.         to the Italian context. Furthermore, we focus on the
    In this paper, our main contributions are as follows:       literary context showing how 19th-century novels are a
(i) we provide an estimate of the performance of eight          challenge for current sentence splitting systems.
sentence splitting tools adopting different approaches on
a specific and challenging text genre, namely historical
literary fiction texts, which has not received enough at-       3. Tools
tention so far; (ii) we compare the results considering the
                                                           Sentence splitting is a fundamental analysis in text pro-
point of view of humanities scholars (in particular Italian
                                                           cessing, for which there are many tools available, also
linguistics) as the main stakeholders in the considered do-
                                                           for Italian. For our evaluation we have selected eight
main, in order to establish a flourishing cross-fertilization
                                                           tools developed with different approaches. Some tools
between NLP and Digital Humanities; (iii) we release
                                                           are modules integrated in larger pipelines, others are
manually split data for four 19th-century Italian novels
                                                           systems specifically created to perform only sentence
and a shared notebook where to run many of the tested
                                                           splitting. It is important to note that selected tools do
systems.3
                                                           not split in the presence of a colon or semicolon. Indeed,
                                                           although recent studies in the punctuation field identify
2. Related Work                                            the colons and semicolons as punctuation marks capable
                                                           of indicating the boundary of a sentence [22], as antic-
Sentence splitting systems can be categorized into three ipated in footnote 1, in this work we have decided to
macro-classes based on the approach used to develop not consider them as separating marks because of the
them. There are rule-based systems, such as Sentence various forms literary texts can take. To clarify the is-
Splitter4 and the Sentencizer module of spaCy, that sue, we can consider the example of direct speech. In “I
use heuristics specific to the various languages and lists Promessi Sposi”, direct speech can be introduced by a
of exceptions and abbreviations. Then, there are super- verbum dicendi and the colons, continuing without any
vised systems that need datasets in which sentences are interruption. In such cases, splitting at the colons would
already correctly segmented to be trained. For example, be relatively easy. However, direct speech can also be
UDPipe [13] and Stanza are trained on Universal De- embedded within a sentence that continues after the quo-
pendencies (UD) treebanks [14]. Finally, unsupervised tation closes, creating a non-autonomous text portion
systems are trained on datasets of non-segmented texts that, during sentence splitting, should be manually re-
taking advantage of features such as the length of words connected to the one preceding the quotation itself (e.g.,
and collocational information. An example is given by Lucia sospirò, e ripeté: «coraggio,» con una voce che smen-
Punkt, available as a module within the NLTK (Natural tiva la parola. EN: Lucia sighed, and repeated, «courage,»
Language Toolkit) library [15]. In our work, we test these in a voice that belied the word.). An equally troublesome
various approaches on a benchmark dataset of historical problem arises when the diegetic frame follows the quo-
literary fiction texts by evaluating the performance of tation instead of preceding it. When this happens, the
eight different systems.                                   colons are absent, and other punctuation marks like com-
    There are several studies that analyze the impact of mas are found before the closing quotation marks or dash
                                                           (e.g., «È il mio caso,» disse Renzo. EN: «That’s my case,»
3
  https://github.com/RacheleSprugnoli/Sentence_Splitting_  said Renzo.). The system would not split the sentences at
  Manzoni                                                  these punctuation marks, yet the diegetic frame follow-
4
https://github.com/mediacloud/sentence-splitter
ing the direct speech has the same value and autonomy as            • WtP10 : an unsupervised multilingual sentence
the one preceding it. Consequently, considering colons                segmentation system based on a self-supervised
and semicolons as sentence boundaries would make the                  learning approach tested on 85 languages, in-
segmentation much more complex and often inaccurate.                  cluding Italian. It does not rely on punctua-
  Selected tools are the following:                                   tion or sentence-segmented training data thus it
                                                                      is a punctuation-agnostic system [27]. Among
      • CoreNLP5 : an NLP pipeline written in Java and                the various available models, we adopted the
        developed by Stanford University [23]. It contains            wtp-canine-s-12l which, according to the of-
        various modules including ssplit that divides                 ficial documentation of the tool, have the best
        a text into sentences via a set of rules. The lat-            results on languages other than English.
        est version of the pipeline (4.5.7) supports eight
        languages including Italian.                            For the evaluation, the tools were used as they are,
      • spaCy: an open-source NLP library which sup-          using their default configurations, without making any
        ports dozens of languages, including Italian, and     customization. For this reason, given the choices moti-
        provides four alternatives for sentence splitting.    vated above, we did not consider other systems, such as
        Among these, statistical models for Italian have      Tint [28], which by default split at colons and semicolons.
        been trained to split on colons and semicolons.
        For this reason, we tested the performance only
        of Sentencizer, the rule-based pipeline com-          4. Dataset
        ponent.                                               The data used to evaluate the aforementioned tools are
      • Sentence Splitter6 : a Python module based            taken from “I Promessi Sposi” in its final version pub-
        on scripts developed for processing the Europarl      lished in 1840-184211 . 3,095 sentences, corresponding
        corpus [24]. It supports several languages with       to 12 chapters of the novel, were manually split. This
        ad-hoc rules.                                         dataset was divided into training, development and test
      • UDPipe7 : an NLP pipeline based on the UD frame-      sets according to the proportions 80/10/10 and using the
        work performing tokenization, sentence splitting,     UD rules for which this proportion was calculated using
        PoS tagging, lemmatization and syntactic analy-       syntactic words as units.12 To obtain syntactic words
        sis. UDPipe 2 is written in Python and uses the       and calculate this splitting, sentences were segmented
        tokenizer of UDPipe 1; among the 131 most re-         and tokenized by hand; this gold standard was then pro-
        cent models (version 2.12), seven are for Italian.    cessed with the combined Stanza model.13 Following this
        We evaluated the model trained on the VIT tree-       division, the test set is made of 324 sentences.
        bank [25] that does not (always) split at colons         Table 1 shows the sentence-ending punctuation marks
        and semicolons.                                       in the test set. Both the total number of occurrences
      • Stanza8 : an NLP package written in Python and        (TOTAL) and the number of times a sign is an end-of-
        based on neural network components. Sentence          sentence marker (EOS) are reported. In addition to the
        splitting is jointly performed with tokenization by   full stop, sentence boundaries can be indicated by ex-
        the TokenizeProcessor module. The default             pressive punctuation marks (!, ?) when followed by a
        Italian model is a combination of multiple UD         capital letter. If followed by a lowercase letter, instead,
        treebanks.                                            these marks only have an expressive role, modifying
      • Ersatz9 : a language-agnostic neural model            the sentence’s internal intonation without determining
        based on a semi-supervised training paradigm.         its end. Low quotation marks («») and long dashes (–),
        It combines the use of regular-expressions to         used for direct speech and thoughts respectively, typi-
        detect candidate sentence boundaries with a           cally determine a sentence boundary when they appear
        Transformer-based binary classifier [26].             with another demarcative punctuation mark (e.g., a full
      • Punkt: an unsupervised system which uses col-         stop). In Manzoni’s novel, if a closing quotation mark
        locational information to identify abbreviations,     (guillemets or long dashes) appears with another punctu-
        initials, and ordinal numbers. All punctuation        ation mark, the latter is usually placed before the former,
        not included in these elements is considered an
        end-of-sentence marker.                               10
                                                                 https://github.com/segment-any-text/wtpsplit
                                                              11
                                                                 The text, fully digitized and available online, was collated with
                                                                 the reference edition [29] prior to analysis, to ensure maximum
5                                                                fidelity to the author’s punctuation choices.
  https://stanfordnlp.github.io/CoreNLP/
6                                                             12
  https://github.com/mediacloud/sentence-splitter                https://universaldependencies.org/release_checklist.html#
7                                                                data-split
  https://ufal.mff.cuni.cz/udpipe
8                                                             13
  https://stanfordnlp.github.io/stanza/                          The output of this process was used to train a new Stanza model
9
  https://github.com/rewicks/ersatz                              as reported in Section 6.
Table 1                                                                sign of the low quotation marks is not recognized
End-of-sentence markers in the test set.                               as a sentence boundary, so in the automatic seg-
                MARK     # TOTAL      # EOS                            mentation it can appear at the beginning or in
                .        277          237                              the middle of a sentence.
                »        90           53                            2. In supervised systems semicolons and colons are
                ?        47           22                               sometimes considered as sentence boundary sig-
                !        31           6                                nals. Indeed, in the VIT treebank and in those
                ...      23           3                                used to train the combined Stanza model, sen-
                –        10           3                                tences are segmented inconsistently: sometimes
                                                                       semicolons and colons are strong punctuation,
                                                                       and sometimes not.
which formally closes the sentence. Lastly, in the novel,           3. Suspension points are always considered strong
suspension points (...) can indicate a sentence bound-                 punctuation marks and the sentence is splitted
ary when they suggest a suspensive allusion or when                    after them.
they mark the interruption of a character’s line due to             4. A sentence is often split after an expressive punc-
linguistic or extra-linguistic contingencies. In such cases,           tuation mark (?, !) even if it is followed by a
suspension points’ demarcative function is shown either                lowercase letter.
by the following capital letter or by an opening quota-             5. The long dash is not recognized as a sentence-
tion mark which indicates the beginning of a different                 ending marker; consequently, either the sentence
character’s line.                                                      continues after the dash or the dash appears at
                                                                       the beginning of the following sentence.
5. Results of the Evaluation
Table 2 reports the results of our evaluation in terms
                                                               6. Training a New Stanza Model
of F1. The best performance (0.94) is registered with          With the rest of the manually split data, namely 2,447
Sentence Splitter, a rule-based system. All other
                                                               sentences for the training set and 324 for the development
tools do not exceed 0.70, thus having significantly lower      set, a new Stanza model specific for Manzoni’s text was
performances than those reported on contemporary Ital-         trained. Different amounts of sentences were used as
ian texts. For example, the official result of UDPipe 2        training in order to control the effect of the dataset size
on the VIT treebank with the 2.12 model starting from          on the performance. The results obtained with 1500 steps
a raw text is 0.95, that is almost 30 points more than         are the following:
what is obtained on our test set. The lowest result (0.51)
is obtained by the unsupervised WtP system. Although                 • 300 sentences: 0.97 F1
the rule-based approach seems to be the most promising,              • 1000 sentences: 0.98 F1
only Sentence Splitter has an excellent result even                  • 2,447 sentences: 0.99 F1
without any adaptation of the existing rules.                  With just 300 sentences there is already a clear improve-
                                                               ment over the default model, obtaining an even higher
Table 2                                                        result than the one obtained with Sentence Splitter,
Results (in terms of F1) of eight systems developed with       the system that had proven to be the best on our test set.
different approaches: rule-based (RB), supervised (S), semi-
supervised (SS) and unsupervised learning (U).
                                                               7. What About Other Novels?
         TYPE     SYSTEM                      F1
         RB       spaCy sentencizer           0.61             Table 4 displays the performance of the same systems
                  CoreNLP 4.5.7 ssplit        0.66
                                                               tested on “I Promessi Sposi” on the first approximately
                  SentenceSplitter            0.94
         S        UDPipe 2 VIT model          0.66             90 sentences of three other important 19th-century nov-
                  Stanza combined             0.69             els:14 “I Malavoglia” (1881) by Giovanni Verga [30], “Le
         SS       Ersatz                      0.60             avventure di Pinocchio. Storia di un burattino” (1883) by
         U
                  Punkt                       0.68             Carlo Collodi [31], “Cuore” (1886) by Edmondo de Amicis
                  WtP wtp-canine-s-12l        0.51             [32].15
                                                               14
                                                                  The reference edition text was used for the analysis of these novels
  Analyzing the outputs of the various systems, it is             too.
possible to notice some recurring errors (few examples         15
                                                                  86 sentences are taken from “I Malavoglia”, corresponding to the
are reported in Table 3):                                         first chapter of the novel; 93 sentences, that is the first two chapters,
                                                                  come from “Le avventure di Pinocchio”; 87 sentences are taken
    1. Misinterpretation of guillemets («,»). The closing        “Cuore”, corresponding to the first three chapters of the novel.
Table 3
Examples of errors in two of the tested systems compared with the manually splitted sentences.
 TEST GOLD                                    UDPipe 2 -VIT model                          Ersatz
 1) «Al sagrestano gli crede?»                                                             1) » «Al sagrestano gli crede?
                                              1) » «Al sagrestano gli crede?» «Perché?»
 2) «Perché?»                                                                              2) » «Perché?
 1) – È lei, di certo!–                       1) – È lei, di certo!– Era proprio lei,      1) – È lei, di certo!
 2) Era proprio lei, con la buona vedova.     con la buona vedova.                         2) – Era proprio lei, con la buona vedova.
 1) Anche Agnese, veda; anche Agnese. . . »   1) Anche Agnese, veda; anche Agnese. . . »   1) Anche Agnese, veda; anche Agnese. . . »
 2) «Uh! ha voglia di scherzare, lei,»        «Uh! ha voglia di scherzare, lei,»           «Uh!
 disse questa.                                disse questa.                                2) ha voglia di scherzare, lei,» disse questa. «


Table 4                                                      whether introduced by colons or not, and sometimes
Results on about 90 sentences taken from other 19th-century  isolate a complete enunciative section. The long dash (–),
novels. Stanza retr. refers to the model retrained on        instead, has a number of different functions [34]: one of
Manzoni’s novel, as described in Section 6.                  these is to signal direct speech, but often marking only
                        Malavoglia     Pinocchio    Cuore    its beginning and not its end. This leads, on one hand,
  spaCy                 0.73           0.35         0.84     to a variety of ways of handling parenthetical elements
  CoreNLP ssplit        0.76           0.72         0.62     and, on the other hand, to a blurred boundary between
  SentenceSplit.        0.77           0.45         0.68     the characters’ speech, the characters’ speech mediated
  UDPipe                0.75           0.79         0.67     by the narrator, and the narrator’s own discourse.
  Stanza                0.71           0.70         0.61        “Pinocchio”, a novel written for a young audience, is
  Stanza retr.          0.90           0.89         0.69     characterized by a strongly dialogic style [35]. For direct
  Ersatz                0.72           0.75         0.66     speech, including the simulated dialogue between the
  Punkt                 0.73           0.77         0.66     narrator and the reader, the long dash (–) is abundantly
  WtP                   0.53           0.78         0.39
                                                             used, but as for "I Malavoglia", the opening dashes are not
                                                             always accompanied by the closing ones. Additionally,
                                                             Collodi frequently uses punctuation clusters, specifically
   The results obtained are once again lower than those the exclamation mark followed by suspension points (!...),
reported for contemporary texts but the model retrained at the end of sentences [36], a possibility mostly not
on “I Promessi Sposi” shows improved performance for contemplated by late 19th-century grammars.
all novels, especially when applied on “I Malavoglia” and       Lastly, Edmondo de Amicis’s novel “Cuore” tells the
on “Le avventure di Pinocchio” (+19 points with respect story of a child’s school experience from his point of view,
to the default Stanza combined model in both cases); adopting a diary-like structure. In “Cuore”, the linguistic
the improvement is more limited for “Cuore” (+ 8 points). form is simple and plain: the sentences are mainly short
   The rule-based approach is promising but with dif- and often end with a standard strong punctuation mark,
ferent systems (spaCy for “Cuore” and ssplit for “I followed by a capital letter. Direct speech is clearly indi-
Malavoglia”). Instead, the VIT model of UDPipe, and cated by long dashes (–), but successive lines of dialogue
therefore a supervised approach, is the best on “Le avven- are arranged consecutively on the page, and in such cases,
ture di Pinocchio”. Some tools obtain extremely different the closing dash of the previous line also serves as the
results depending on the text they process. spaCy and opening dash of the next line. Since the lines of dialogue
Sentence Splitter record a very low result on “Le are perfectly integrated into the narrative structure, they
avventure di Pinocchio” (0.35 and 0.45 respectively) while can end with various punctuation marks, from commas
WtP has an F1 of only 0.39 on “Cuore”, half of what it to semicolons to full stops. When the punctuation mark
achieved on “Le avventure di Pinocchio”.                     is not strong, after the preliminary conclusion of the line,
   This diversified situation is principally due to the fact the text continues with the narrator’s discourse.
that each novel presents unique characteristics, even in        Beyond the specific differences listed schematically
punctuation.                                                 above, there are also some common typographical and
   “I Malavoglia” is a choral novel in which the various punctuation features among the considered novels. For
styles of speech of the characters and the narrative voice example, when a closing quotation mark appears with
are mixed together. Punctuation marks largely represent another punctuation mark, the latter in general occurs
this mixture. Indeed, among the main peculiarities of before the former, as found in “I Promessi Sposi”.
the novel is the original and personal use of quotation
marks. For example, guillemets («,») are frequently used
to refer to popular sayings and proverbs as well as to short
formulas [33], which sometimes intersperse the diegesis,
8. Conclusions                                                   References
This paper presents an assessment of the performance        [1] I. Bonomi, A. Masini, S. Morgana, M. Piotti, et al.,
of eight sentence splitting tools adopting different ap-        Elementi di linguistica italiana, volume 103, Carocci,
proaches on four 19th-century novels: "I Promessi Sposi"        2010.
by Alessandro Manzoni, "I Malavoglia" by Giovanni           [2] D. D. Palmer, Chapter 2: Tokenisation and sen-
Verga", "Le avventure di Pinocchio" by Carlo Collodi, and       tence segmentation, Handbook of natural language
"Cuore" by Edmondo de Amicis. Although these texts              processing (2007).
belong to the same historical period, they show specific    [3] R. Dridan, S. Oepen, Document parsing: Towards
features depending on the form and content of the novel         realistic syntactic analysis, in: Proceedings of The
as well as the author’s stylistic choices. Among these          13th International Conference on Parsing Technolo-
features is punctuation, which in the late 19th century         gies (IWPT 2013), 2013, pp. 127–133.
had not reached a detectable stability yet and was rather   [4] R. Wicks, M. Post, Does sentence segmentation
experiencing a paradigmatic change.                             matter for machine translation?, in: Proceedings
   Since sentence splitting for Western languages, includ-      of the Seventh Conference on Machine Translation
ing Italian, relies heavily on punctuation disambiguation,      (WMT), 2022, pp. 843–854.
applying existing tools to the four novels considered has   [5] Y. Liu, S. Xie, Impact of automatic sentence segmen-
resulted in performances well below the standards. These        tation on meeting summarization, in: 2008 IEEE
texts demonstrate that sentence splitting is not a com-         International Conference on Acoustics, Speech and
pletely solved task.                                            Signal Processing, IEEE, 2008, pp. 5009–5012.
   On the other hand, applying the model retrained on “I    [6] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, C. D. Man-
Promessi Sposi” to the other three novels showed signifi-       ning, Stanza: A Python natural language processing
cant improvements for “Le avventure di Pinocchio” and           toolkit for many human languages, in: Proceed-
“I Malavoglia”, and a moderate improvement for “Cuore.”         ings of the 58th Annual Meeting of the Associa-
This result suggests that shared historical context and         tion for Computational Linguistics: System Demon-
belonging to the same textual genre may offer sufficient        strations, 2020. URL: https://nlp.stanford.edu/pubs/
similarities to improve the model’s performance. How-           qi2020stanza.pdf.
ever, the example of "Cuore" is evidence of how this is     [7] C. Bosco, S. Montemagni, M. Simi, et al., Converting
sometimes not enough: some specific features in form,           Italian Treebanks: Towards an Italian Stanford De-
punctuation and style continue to affect sentence split-        pendency Treebank, in: Proceedings of the 7th Lin-
ting, demonstrating that although retraining may mit-           guistic Annotation Workshop and Interoperability
igate some problems, it does not completely overcome            with Discourse, The Association for Computational
the inherent variability of these texts.                        Linguistics, 2013, pp. 61–69.
   Philologists have increasingly focused on preserving     [8] M. Sanguinetti, C. Bosco, A. Lavelli, A. Mazzei,
the original punctuation as a part of the author’s creation     O. Antonelli, F. Tamburini, PoSTWITA-UD: an
of the text, providing valuable and reliable supports of        Italian Twitter treebank in Universal Dependen-
study for scholars of linguistics and the history of the Ital-  cies, in: N. Calzolari, K. Choukri, C. Cieri, T. De-
ian language. Their combined knowledge is precious for          clerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard,
achieving accurate sentence splitting in these texts. Thus,     J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis,
sentence splitting can be an interesting common ground          T. Tokunaga (Eds.), Proceedings of the Eleventh In-
between different disciplines, potentially leading to the       ternational Conference on Language Resources and
development of tools for the automatic analysis of his-         Evaluation (LREC 2018), European Language Re-
torical literary texts. This field remains under-explored       sources Association (ELRA), Miyazaki, Japan, 2018.
in the Italian context, offering significant opportunities      URL: https://aclanthology.org/L18-1279.
for further study and cross-disciplinary collaboration.     [9] E. Tonani, Premessa. Tra punteggiatura e ti-
                                                                pografia,        in: E. Tonani (Ed.), Il romanzo
                                                                in bianco e nero. Ricerche sull’uso degli spazi
Acknowledgments                                                 bianchi e dell’interpunzione nella narrativa italiana
                                                                dall’Ottocento a oggi, Franco Cesati, Firenze, 2010,
Questa pubblicazione è stata realizzata da ricercatrice
                                                                pp. 13–28.
con contratto di ricerca cofinanziato dall’Unione europea
                                                           [10] A. Ferrari, Punteggiatura, in: G. Antonelli, M. Mo-
- PON Ricerca e Innovazione 2014-2020 ai sensi dell’art.
                                                                tolese, L. Tomasi (Eds.), Storia dell’italiano scritto.
24, comma 3, lett. a, della Legge 30 dicembre 2010, n. 240
                                                                Grammatiche, volume IV, Carocci, Roma, 2018, pp.
e s.m.i. e del D.M. 10 agosto 2021 n. 1062.
                                                                169–202.
                                                           [11] B. Mortara Garavelli, Prontuario di punteggiatura,
     Laterza, Bari, 2003.                                   [22] A. Ferrari, L. Lala, F. Longo, F. Pecorari, B. Rosi,
[12] A. Manzoni, F. Ghisalberti, A. Chiari, L’ultima re-         R. Stojmenova, La punteggiatura italiana contem-
     visione dei Promessi Sposi, in: Tutte le opere di           poranea. Un’analisi comunicativo-testuale, Carocci,
     Alessandro Manzoni. I Promessi Sposi, volume II,            Roma, 2018.
     Mondadori, Milano, 1954, pp. 789–989.                  [23] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel,
[13] M. Straka, UDPipe 2.0 prototype at CoNLL 2018               S. Bethard, D. McClosky, The Stanford CoreNLP
     UD shared task, in: D. Zeman, J. Hajič (Eds.), Pro-         natural language processing toolkit, in: Proceed-
     ceedings of the CoNLL 2018 Shared Task: Multilin-           ings of 52nd annual meeting of the association for
     gual Parsing from Raw Text to Universal Depen-              computational linguistics: system demonstrations,
     dencies, Association for Computational Linguis-             2014, pp. 55–60.
     tics, Brussels, Belgium, 2018, pp. 197–207. URL:       [24] P. Koehn, Europarl: A parallel corpus for statistical
     https://aclanthology.org/K18-2020. doi:10.18653/            machine translation, in: Proceedings of Machine
     v1/K18-2020.                                                Translation Summit X: Papers, Phuket, Thailand,
[14] M.-C. De Marneffe, C. D. Manning, J. Nivre, D. Ze-          2005, pp. 79–86. URL: https://aclanthology.org/2005.
     man, Universal Dependencies, Computational lin-             mtsummit-papers.11.
     guistics 47 (2021) 255–308.                            [25] R. Delmonte, A. Bristot, S. Tonelli, VIT-Venice Ital-
[15] T. Kiss, J. Strunk,         Unsupervised multilin-          ian Treebank: Syntactic and quantitative features.,
     gual sentence boundary detection, Computa-                  in: Sixth International Workshop on Treebanks and
     tional Linguistics 32 (2006) 485–525. URL: https:           Linguistic Theories, volume 1, Northern European
     //aclanthology.org/J06-4003. doi:10.1162/coli.              Association for Language Technol, 2007, pp. 43–54.
     2006.32.4.485.                                         [26] R. Wicks, M. Post, A unified approach to sentence
[16] Y. Liu, A. Stolcke, E. Shriberg, M. Harper, Using           segmentation of punctuated text in many languages,
     conditional random fields for sentence boundary             in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Proceed-
     detection in speech, in: Proceedings of the 43rd an-        ings of the 59th Annual Meeting of the Association
     nual meeting of the Association for Computational           for Computational Linguistics and the 11th Interna-
     Linguistics (ACL’05), 2005, pp. 451–458.                    tional Joint Conference on Natural Language Pro-
[17] R. Sheik, T. Gokul, S. Nirmala, Efficient deep              cessing (Volume 1: Long Papers), Association for
     learning-based sentence boundary detection in le-           Computational Linguistics, Online, 2021, pp. 3995–
     gal text, in: Proceedings of the Natural Legal Lan-         4007. URL: https://aclanthology.org/2021.acl-long.
     guage Processing Workshop 2022, 2022, pp. 208–              309. doi:10.18653/v1/2021.acl-long.309.
     217.                                                   [27] B. Minixhofer, J. Pfeiffer, I. Vulić, Where’s the
[18] D. Rudrapal, A. Jamatia, K. Chakma, A. Das, B. Gam-         point? self-supervised multilingual punctuation-
     bäck, Sentence boundary detection for social media          agnostic sentence segmentation, in: A. Rogers,
     text, in: Proceedings of the 12th International Con-        J. Boyd-Graber, N. Okazaki (Eds.), Proceedings
     ference on Natural Language Processing, 2015, pp.           of the 61st Annual Meeting of the Association
     254–260.                                                    for Computational Linguistics (Volume 1: Long
[19] A. A. Azzi, H. Bouamor, S. Ferradans, The FinSBD-           Papers), Association for Computational Linguis-
     2019 shared task: Sentence boundary detection in            tics, Toronto, Canada, 2023, pp. 7215–7235. URL:
     PDF noisy text in the financial domain, in: C.-             https://aclanthology.org/2023.acl-long.398. doi:10.
     C. Chen, H.-H. Huang, H. Takamura, H.-H. Chen               18653/v1/2023.acl-long.398.
     (Eds.), Proceedings of the First Workshop on Fi-       [28] A. Palmero Aprosio, G. Moretti, Tint 2.0: an all-
     nancial Technology and Natural Language Process-            inclusive suite for NLP in Italian, in: Proceedings
     ing, Macao, China, 2019, pp. 74–80. URL: https:             of the Fifth Italian Conference on Computational
     //aclanthology.org/W19-5512.                                Linguistics (CLiC-it 2018), Accademia University
[20] T. Brugger, M. Stürmer, J. Niklaus, MultiLegalSBD:          Press, 2018, pp. 311–317.
     a multilingual legal sentence boundary detection       [29] A. Manzoni, B. Colli, I Promessi Sposi. Edizione ge-
     dataset, in: Proceedings of the Nineteenth Inter-           netica della Quarantana, Casa del Manzoni, Milano,
     national Conference on Artificial Intelligence and          2024.
     Law, 2023, pp. 42–51.                                  [30] G. Verga, F. Cecco, I Malavoglia, Fondazione Verga-
[21] J. Read, R. Dridan, S. Oepen, L. J. Solberg, Sen-           Interlinea, Catania-Novara, 2014.
     tence boundary detection: A long solved problem?,      [31] C. Collodi, O. Castellani Pollidori, Le avventure
     in: M. Kay, C. Boitet (Eds.), Proceedings of COL-           di Pinocchio, Fondazione nazionale Carlo Collodi,
     ING 2012: Posters, The COLING 2012 Organizing               Pescia, 1983.
     Committee, Mumbai, India, 2012, pp. 985–994. URL:      [32] E. De Amicis, L. Tamburini, Cuore. Libro per
     https://aclanthology.org/C12-2096.                          ragazzi, Einaudi, Torino, 2018 (1° ed. 1972).
[33] G. B. Bronzini, Proverbi, discorso e gesto prover-
     biale nei «Malavoglia», in: I Malavoglia. Atti del
     Congresso Internazionale di Studi (26-28 novembre
     1981), Biblioteca della Fondazione Verga, Catania,
     1982, pp. 637–684.
[34] E. Tonani, Il ’bianco di dialogato’ e il trattamento
     tipografico del discorso diretto, in: E. Tonani
     (Ed.), Il romanzo in bianco e nero. Ricerche sull’uso
     degli spazi bianchi e dell’interpunzione nella nar-
     rativa italiana dall’Ottocento a oggi, Franco Cesati,
     Firenze, 2010, pp. 103–136.
[35] R. Pellerey, Pinocchio tra dialogo e scrittura,
     Belfagor 60 (2005) 267–284. URL: https://www.jstor.
     org/stable/26150287.
[36] O. Castellani Pollidori, Introduzione, in: C. Collodi,
     O. Castellani Pollidori (Eds.), Le avventure di Pinoc-
     chio, Fondazione nazionale Carlo Collodi, Pescia,
     1983, pp. XIII–LXXXIV.

</pre>