-

Is Sentence Splitting a Solved Task? Experiments to the Intersection Between NLP and Italian Linguistics

Arianna Redaelli

Rachele Sprugnoli

0 0 Università di Parma , Via D'Azeglio, 85, 43125 Parma , Italy

Sentence splitting, that is the segmentation of the raw input text into sentences, is a fundamental step in text processing. Although it is considered a solved task for texts such as news articles and Wikipedia pages, the performance of systems can vary greatly depending on the text genre. This paper presents the evaluation of the performance of eight sentence splitting tools adopting diferent approaches (rule-based, supervised, semi-supervised, and unsupervised learning) on Italian 19th-century novels, a genre that has not received suficient attention so far but which can be an interesting common ground between Natural Language Processing and Digital Humanities.

eol>sentence splitting text segmentation literary texts Italian

1. Introduction Stanza [ 6 ] and spaCy2, have mostly been trained and evaluated on fairly formal texts, such as news articles and Sentence splitting is the process of segmenting a text Wikipedia pages, so the publicly reported performances into sentences1 by detecting their boundaries, which, at tend to be high, i.e. above 0.90 in terms of F1. However, least for Western languages, including Italian, usually the text genre has a significant impact on the results. For correspond to certain punctuation marks [ 2 ]. This means example, in the CoNLL 2018 shared task “Multilingual that sentence splitting, for many languages, is a mat- Parsing from Raw Text to Universal Dependencies”, the ter of punctuation disambiguation, that is, recognizing best system on the Italian ISDT treebank [ 7 ] achieved a when a punctuation mark signals a sentence boundary F1 of 0.99, while on the PoSTWITA treebank, made of or not. The importance of sentence splitting is often un- tweets [ 8 ], the highest result was 0.66. derestimated because it is considered an easy task, but its Given these variations, considering less formal text quality has a strong impact on the quality of subsequent genres could provide valuable insights into the challenges text processing because errors can propagate reducing of sentence splitting. Among these genres are literary the performance of downstream tasks such as Syntac- texts, which present unique and peculiar stylistic and tic Analysis [ 3 ], Machine Translation [ 4 ] and Automatic creative features that can break traditional grammatical Summarization [ 5 ]. norms, including punctuation ones [ 9 ]. These features deThe most popular pipeline models, such as those of pend on both authorial choices and the cultural context of the time. As a matter of facts, punctuation can vary significantly depending on the historical period; literary texts CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, may follow prevailing trends or oppose them, giving rise Dec 04 — 06, 2024, Pisa, Italy to new trends. This phenomenon is particularly evident * Corresponding author. in 19th century, when the Italian usus punctandi began † Tthhoirssp.aFporerthise tshpeecrieficsucoltnocfertnhseocfotlhlaebIotraalitaionnacbaedtwemeeicnatthtreibtwutoioanu- shifting from a primarily syntactic usage, prescribed by system: Rachele Sprugnoli is responsible for Sections 2, 3, 6; Ar- grammar books, to a communicative-textual usage of ianna Redaelli is responsible for Sections 1, 4, 8. Section 7 were punctuation marks [ 10 ]. Since this shift was probably collaboratively written by the two authors. influenced by the reflections and the practical uses of $ arianna.redaelli@unipr.it (A. Redaelli); prominent authors such as Alessandro Manzoni [ 11 ], our rachele.sprugnoli@unipr.it (R. Sprugnoli) study focuses on his historical novel, “I Promessi Sposi”. (R. 0S0p0r0u-g0n0o01li-)6374-9033 (A. Redaelli); 0000-0001-6861-5595 The author paid meticulous attention to the punctuation © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License of the text, revising it up to the final print proofs, and Attribution 4.0 International (CC BY 4.0). 1By "sentence" we mean a coherent set of words constructed ac- made specific and personal choices in collaboration with cording to the general rules of the language, conveying a complete the publisher, alongside more classical ones [12]. Althought that makes sense on its own [ 1 ]. A sentence ends with though not always consistent, Manzoni’s decisions make calasmtroantigonpupnocintut)aatinodnimstayrpkic(ea.lgly., ffoullllowstoedp,bqyuaesctaiopnitamllaertkte,ro.rTehxe- the novel particularly complex and interesting from a definition of sentence adopted here, which like any definition is punctuation perspective. Furthermore, “I Promessi Sposi” inherently problematic, is motivated by the specific requirements of the present work, as will be seen below. 2https://spacy.io has been a fundamental reference for the development of text genre on sentence splitting, but literary texts are a common written Italian language: starting from this as- rarely considered. For example, Liu et al. [16] work on sumption, many of the author’s punctuation choices have speech transcriptions, Sheik et al. [17] on legal texts, and been adopted by later grammars for rule-making, though Rudrapal et al. [18] on social media posts. Moreover, a only some of them have become part of the standard. shared task on sentence boundary detection in the finanGiven that punctuation was still undergoing standard- cial domain (FinSBD) was organized in 2019, 2020 and ization at the time, and that its use can depend not only 2021 [19]. on the conventions of the period but also on the writer’s Most of the available studies concern the processing personal style, the type of content being addressed (and of English texts while Italian is usually not included in how it is presented), and even the influence of typog- the evaluation. An interesting exception is given by a raphy during the printing process, we also decided to work on multilingual legal texts that contains a detailed broaden our study to include sections from other novels evaluation of the results on Italian documents [20]. contemporary to Manzoni’s (1840-42). Specifically, we Our work draws inspiration from the assessment on analyzed "I Malavoglia" (1881) by Giovanni Verga, "Le English texts provided by Read et al. [21] which includes, avventure di Pinocchio. Storia di un burattino" (1883) by among others, the Sherlock Holmes stories, but moving Carlo Collodi, and "Cuore" (1886) by Edmondo de Amicis. to the Italian context. Furthermore, we focus on the

In this paper, our main contributions are as follows: literary context showing how 19th-century novels are a (i) we provide an estimate of the performance of eight challenge for current sentence splitting systems. sentence splitting tools adopting diferent approaches on a specific and challenging text genre, namely historical literary fiction texts, which has not received enough at- 3. Tools tention so far; (ii) we compare the results considering the point of view of humanities scholars (in particular Italian linguistics) as the main stakeholders in the considered domain, in order to establish a flourishing cross-fertilization between NLP and Digital Humanities; (iii) we release manually split data for four 19th-century Italian novels and a shared notebook where to run many of the tested systems.3

2. Related Work Sentence splitting systems can be categorized into three

macro-classes based on the approach used to develop them. There are rule-based systems, such as Sentence Splitter4 and the Sentencizer module of spaCy, that use heuristics specific to the various languages and lists of exceptions and abbreviations. Then, there are supervised systems that need datasets in which sentences are already correctly segmented to be trained. For example, UDPipe [13] and Stanza are trained on Universal Dependencies (UD) treebanks [14]. Finally, unsupervised systems are trained on datasets of non-segmented texts taking advantage of features such as the length of words and collocational information. An example is given by Punkt, available as a module within the NLTK (Natural Language Toolkit) library [15]. In our work, we test these various approaches on a benchmark dataset of historical literary fiction texts by evaluating the performance of eight diferent systems.

There are several studies that analyze the impact of

3https://github.com/RacheleSprugnoli/Sentence_Splitting_

Manzoni 4https://github.com/mediacloud/sentence-splitter Sentence splitting is a fundamental analysis in text processing, for which there are many tools available, also for Italian. For our evaluation we have selected eight tools developed with diferent approaches. Some tools are modules integrated in larger pipelines, others are systems specifically created to perform only sentence splitting. It is important to note that selected tools do not split in the presence of a colon or semicolon. Indeed, although recent studies in the punctuation field identify the colons and semicolons as punctuation marks capable of indicating the boundary of a sentence [22], as anticipated in footnote 1, in this work we have decided to not consider them as separating marks because of the various forms literary texts can take. To clarify the issue, we can consider the example of direct speech. In “I Promessi Sposi”, direct speech can be introduced by a verbum dicendi and the colons, continuing without any interruption. In such cases, splitting at the colons would be relatively easy. However, direct speech can also be embedded within a sentence that continues after the quotation closes, creating a non-autonomous text portion that, during sentence splitting, should be manually reconnected to the one preceding the quotation itself (e.g., Lucia sospirò, e ripeté: «coraggio,» con una voce che smentiva la parola. EN: Lucia sighed, and repeated, «courage,» in a voice that belied the word.). An equally troublesome problem arises when the diegetic frame follows the quotation instead of preceding it. When this happens, the colons are absent, and other punctuation marks like commas are found before the closing quotation marks or dash (e.g., «È il mio caso,» disse Renzo. EN: «That’s my case,» said Renzo.). The system would not split the sentences at these punctuation marks, yet the diegetic frame following the direct speech has the same value and autonomy as the one preceding it. Consequently, considering colons and semicolons as sentence boundaries would make the segmentation much more complex and often inaccurate.

Selected tools are the following: • WtP10: an unsupervised multilingual sentence segmentation system based on a self-supervised learning approach tested on 85 languages, including Italian. It does not rely on punctuation or sentence-segmented training data thus it is a punctuation-agnostic system [27]. Among the various available models, we adopted the wtp-canine-s-12l which, according to the ofifcial documentation of the tool, have the best results on languages other than English. • CoreNLP5: an NLP pipeline written in Java and developed by Stanford University [23]. It contains various modules including ssplit that divides a text into sentences via a set of rules. The latest version of the pipeline (4.5.7) supports eight languages including Italian. For the evaluation, the tools were used as they are, • spaCy: an open-source NLP library which sup- using their default configurations, without making any ports dozens of languages, including Italian, and customization. For this reason, given the choices motiprovides four alternatives for sentence splitting. vated above, we did not consider other systems, such as Among these, statistical models for Italian have Tint [28], which by default split at colons and semicolons. been trained to split on colons and semicolons.

For this reason, we tested the performance only 4. Dataset of Sentencizer, the rule-based pipeline component. • Sentence Splitter6: a Python module based on scripts developed for processing the Europarl corpus [24]. It supports several languages with ad-hoc rules.

The data used to evaluate the aforementioned tools are taken from “I Promessi Sposi” in its final version published in 1840-184211. 3,095 sentences, corresponding to 12 chapters of the novel, were manually split. This dataset was divided into training, development and test • UDPipe7: an NLP pipeline based on the UD frame- sets according to the proportions 80/10/10 and using the work performing tokenization, sentence splitting, UD rules for which this proportion was calculated using PoS tagging, lemmatization and syntactic analy- syntactic words as units.12 To obtain syntactic words sis. UDPipe 2 is written in Python and uses the and calculate this splitting, sentences were segmented tokenizer of UDPipe 1; among the 131 most re- and tokenized by hand; this gold standard was then procent models (version 2.12), seven are for Italian. cessed with the combined Stanza model.13 Following this We evaluated the model trained on the VIT tree- division, the test set is made of 324 sentences. bank [25] that does not (always) split at colons Table 1 shows the sentence-ending punctuation marks and semicolons. in the test set. Both the total number of occurrences • Stanza8: an NLP package written in Python and (TOTAL) and the number of times a sign is an end-ofbased on neural network components. Sentence sentence marker (EOS) are reported. In addition to the splitting is jointly performed with tokenization by full stop, sentence boundaries can be indicated by exthe TokenizeProcessor module. The default pressive punctuation marks (!, ?) when followed by a Italian model is a combination of multiple UD capital letter. If followed by a lowercase letter, instead, treebanks. these marks only have an expressive role, modifying • Ersatz9: a language-agnostic neural model the sentence’s internal intonation without determining based on a semi-supervised training paradigm. its end. Low quotation marks («») and long dashes (–), It combines the use of regular-expressions to used for direct speech and thoughts respectively, typidetect candidate sentence boundaries with a cally determine a sentence boundary when they appear Transformer-based binary classifier [26]. with another demarcative punctuation mark (e.g., a full • Punkt: an unsupervised system which uses col- stop). In Manzoni’s novel, if a closing quotation mark locational information to identify abbreviations, (guillemets or long dashes) appears with another punctuinitials, and ordinal numbers. All punctuation ation mark, the latter is usually placed before the former, not included in these elements is considered an end-of-sentence marker. 5https://stanfordnlp.github.io/CoreNLP/ 6https://github.com/mediacloud/sentence-splitter 7https://ufal.mf.cuni.cz/udpipe 8https://stanfordnlp.github.io/stanza/ 9https://github.com/rewicks/ersatz 10https://github.com/segment-any-text/wtpsplit 11The text, fully digitized and available online, was collated with the reference edition [29] prior to analysis, to ensure maximum ifdelity to the author’s punctuation choices. 12https://universaldependencies.org/release_checklist.html#

data-split 13The output of this process was used to train a new Stanza model as reported in Section 6. which formally closes the sentence. Lastly, in the novel, suspension points (...) can indicate a sentence boundary when they suggest a suspensive allusion or when they mark the interruption of a character’s line due to linguistic or extra-linguistic contingencies. In such cases, suspension points’ demarcative function is shown either by the following capital letter or by an opening quotation mark which indicates the beginning of a diferent character’s line. 5. Results of the Evaluation sign of the low quotation marks is not recognized as a sentence boundary, so in the automatic segmentation it can appear at the beginning or in the middle of a sentence. 2. In supervised systems semicolons and colons are sometimes considered as sentence boundary signals. Indeed, in the VIT treebank and in those used to train the combined Stanza model, sentences are segmented inconsistently: sometimes semicolons and colons are strong punctuation, and sometimes not. 3. Suspension points are always considered strong punctuation marks and the sentence is splitted after them. 4. A sentence is often split after an expressive punctuation mark (?, !) even if it is followed by a lowercase letter. 5. The long dash is not recognized as a sentenceending marker; consequently, either the sentence continues after the dash or the dash appears at the beginning of the following sentence.

Table 2 reports the results of our evaluation in terms of F1. The best performance (0.94) is registered with With the rest of the manually split data, namely 2,447 Sentence Splitter, a rule-based system. All other sentences for the training set and 324 for the development tools do not exceed 0.70, thus having significantly lower set, a new Stanza model specific for Manzoni’s text was performances than those reported on contemporary Ital- trained. Diferent amounts of sentences were used as ian texts. For example, the oficial result of UDPipe 2 training in order to control the efect of the dataset size on the VIT treebank with the 2.12 model starting from on the performance. The results obtained with 1500 steps a raw text is 0.95, that is almost 30 points more than are the following: what is obtained on our test set. The lowest result (0.51) is obtained by the unsupervised WtP system. Although • 300 sentences: 0.97 F1 the rule-based approach seems to be the most promising, • 1000 sentences: 0.98 F1 only Sentence Splitter has an excellent result even • 2,447 sentences: 0.99 F1 without any adaptation of the existing rules.

6. Training a New Stanza Model 1. Misinterpretation of guillemets («,»). The closing With just 300 sentences there is already a clear improvement over the default model, obtaining an even higher result than the one obtained with Sentence Splitter, the system that had proven to be the best on our test set. 7. What About Other Novels?

14The reference edition text was used for the analysis of these novels

too. 1586 sentences are taken from “I Malavoglia”, corresponding to the ifrst chapter of the novel; 93 sentences, that is the first two chapters, come from “Le avventure di Pinocchio”; 87 sentences are taken “Cuore”, corresponding to the first three chapters of the novel. Table 4 whether introduced by colons or not, and sometimes Results on about 90 sentences taken from other 19th-century isolate a complete enunciative section. The long dash (–), novels. Stanza retr. refers to the model retrained on instead, has a number of diferent functions [ 34]: one of Manzoni’s novel, as described in Section 6. these is to signal direct speech, but often marking only

Malavoglia Pinocchio Cuore its beginning and not its end. This leads, on one hand, spaCy 0.73 0.35 0.84 to a variety of ways of handling parenthetical elements CoreNLP ssplit 0.76 0.72 0.62 and, on the other hand, to a blurred boundary between SentenceSplit. 0.77 0.45 0.68 the characters’ speech, the characters’ speech mediated UDPipe 0.75 0.79 0.67 by the narrator, and the narrator’s own discourse. Stanza 0.71 0.70 0.61 “Pinocchio”, a novel written for a young audience, is Stanza retr. 0.90 0.89 0.69 characterized by a strongly dialogic style [35]. For direct Ersatz 0.72 0.75 0.66 speech, including the simulated dialogue between the Punkt 0.73 0.77 0.66 narrator and the reader, the long dash (–) is abundantly WtP 0.53 0.78 0.39 used, but as for "I Malavoglia", the opening dashes are not always accompanied by the closing ones. Additionally,

Collodi frequently uses punctuation clusters, specifically

The results obtained are once again lower than those the exclamation mark followed by suspension points (!...), reported for contemporary texts but the model retrained at the end of sentences [36], a possibility mostly not on “I Promessi Sposi” shows improved performance for contemplated by late 19th-century grammars. all novels, especially when applied on “I Malavoglia” and Lastly, Edmondo de Amicis’s novel “Cuore” tells the on “Le avventure di Pinocchio” (+19 points with respect story of a child’s school experience from his point of view, to the default Stanza combined model in both cases); adopting a diary-like structure. In “Cuore”, the linguistic the improvement is more limited for “Cuore” (+ 8 points). form is simple and plain: the sentences are mainly short

The rule-based approach is promising but with dif- and often end with a standard strong punctuation mark, ferent systems (spaCy for “Cuore” and ssplit for “I followed by a capital letter. Direct speech is clearly indiMalavoglia”). Instead, the VIT model of UDPipe, and cated by long dashes (–), but successive lines of dialogue therefore a supervised approach, is the best on “Le avven- are arranged consecutively on the page, and in such cases, ture di Pinocchio”. Some tools obtain extremely diferent the closing dash of the previous line also serves as the results depending on the text they process. spaCy and opening dash of the next line. Since the lines of dialogue Sentence Splitter record a very low result on “Le are perfectly integrated into the narrative structure, they avventure di Pinocchio” (0.35 and 0.45 respectively) while can end with various punctuation marks, from commas WtP has an F1 of only 0.39 on “Cuore”, half of what it to semicolons to full stops. When the punctuation mark achieved on “Le avventure di Pinocchio”. is not strong, after the preliminary conclusion of the line,

This diversified situation is principally due to the fact the text continues with the narrator’s discourse. that each novel presents unique characteristics, even in Beyond the specific diferences listed schematically punctuation. above, there are also some common typographical and “I Malavoglia” is a choral novel in which the various punctuation features among the considered novels. For styles of speech of the characters and the narrative voice example, when a closing quotation mark appears with are mixed together. Punctuation marks largely represent another punctuation mark, the latter in general occurs this mixture. Indeed, among the main peculiarities of before the former, as found in “I Promessi Sposi”. the novel is the original and personal use of quotation marks. For example, guillemets («,») are frequently used to refer to popular sayings and proverbs as well as to short formulas [33], which sometimes intersperse the diegesis,

8. Conclusions This paper presents an assessment of the performance

of eight sentence splitting tools adopting diferent approaches on four 19th-century novels: "I Promessi Sposi" by Alessandro Manzoni, "I Malavoglia" by Giovanni Verga", "Le avventure di Pinocchio" by Carlo Collodi, and "Cuore" by Edmondo de Amicis. Although these texts belong to the same historical period, they show specific features depending on the form and content of the novel as well as the author’s stylistic choices. Among these features is punctuation, which in the late 19th century had not reached a detectable stability yet and was rather experiencing a paradigmatic change.

Since sentence splitting for Western languages, including Italian, relies heavily on punctuation disambiguation, applying existing tools to the four novels considered has resulted in performances well below the standards. These texts demonstrate that sentence splitting is not a completely solved task.

On the other hand, applying the model retrained on “I Promessi Sposi” to the other three novels showed significant improvements for “Le avventure di Pinocchio” and “I Malavoglia”, and a moderate improvement for “Cuore.” This result suggests that shared historical context and belonging to the same textual genre may ofer suficient similarities to improve the model’s performance. However, the example of "Cuore" is evidence of how this is sometimes not enough: some specific features in form, punctuation and style continue to afect sentence splitting, demonstrating that although retraining may mitigate some problems, it does not completely overcome the inherent variability of these texts.

Philologists have increasingly focused on preserving the original punctuation as a part of the author’s creation of the text, providing valuable and reliable supports of study for scholars of linguistics and the history of the Italian language. Their combined knowledge is precious for achieving accurate sentence splitting in these texts. Thus, sentence splitting can be an interesting common ground between diferent disciplines, potentially leading to the development of tools for the automatic analysis of historical literary texts. This field remains under-explored in the Italian context, ofering significant opportunities for further study and cross-disciplinary collaboration.

Acknowledgments Questa pubblicazione è stata realizzata da ricercatrice

con contratto di ricerca cofinanziato dall’Unione europea - PON Ricerca e Innovazione 2014-2020 ai sensi dell’art. 24, comma 3, lett. a, della Legge 30 dicembre 2010, n. 240 e s.m.i. e del D.M. 10 agosto 2021 n. 1062.

Laterza, Bari, 2003. [22] A. Ferrari, L. Lala, F. Longo, F. Pecorari, B. Rosi, [12] A. Manzoni, F. Ghisalberti, A. Chiari, L’ultima re- R. Stojmenova, La punteggiatura italiana contemvisione dei Promessi Sposi, in: Tutte le opere di poranea. Un’analisi comunicativo-testuale, Carocci, Alessandro Manzoni. I Promessi Sposi, volume II, Roma, 2018.

Mondadori, Milano, 1954, pp. 789–989. [23] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, [13] M. Straka, UDPipe 2.0 prototype at CoNLL 2018 S. Bethard, D. McClosky, The Stanford CoreNLP UD shared task, in: D. Zeman, J. Hajič (Eds.), Pro- natural language processing toolkit, in: Proceedceedings of the CoNLL 2018 Shared Task: Multilin- ings of 52nd annual meeting of the association for gual Parsing from Raw Text to Universal Depen- computational linguistics: system demonstrations, dencies, Association for Computational Linguis- 2014, pp. 55–60. tics, Brussels, Belgium, 2018, pp. 197–207. URL: [24] P. Koehn, Europarl: A parallel corpus for statistical https://aclanthology.org/K18-2020. doi:10.18653/ machine translation, in: Proceedings of Machine v1/K18-2020. Translation Summit X: Papers, Phuket, Thailand, [14] M.-C. De Marnefe, C. D. Manning, J. Nivre, D. Ze- 2005, pp. 79–86. URL: https://aclanthology.org/2005. man, Universal Dependencies, Computational lin- mtsummit-papers.11.

guistics 47 (2021) 255–308. [25] R. Delmonte, A. Bristot, S. Tonelli, VIT-Venice Ital[15] T. Kiss, J. Strunk, Unsupervised multilin- ian Treebank: Syntactic and quantitative features., gual sentence boundary detection, Computa- in: Sixth International Workshop on Treebanks and tional Linguistics 32 (2006) 485–525. URL: https: Linguistic Theories, volume 1, Northern European //aclanthology.org/J06-4003. doi:10.1162/coli. Association for Language Technol, 2007, pp. 43–54. 2006.32.4.485. [26] R. Wicks, M. Post, A unified approach to sentence [16] Y. Liu, A. Stolcke, E. Shriberg, M. Harper, Using segmentation of punctuated text in many languages, conditional random fields for sentence boundary in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Proceeddetection in speech, in: Proceedings of the 43rd an- ings of the 59th Annual Meeting of the Association nual meeting of the Association for Computational for Computational Linguistics and the 11th InternaLinguistics (ACL’05), 2005, pp. 451–458. tional Joint Conference on Natural Language Pro[17] R. Sheik, T. Gokul, S. Nirmala, Eficient deep cessing (Volume 1: Long Papers), Association for learning-based sentence boundary detection in le- Computational Linguistics, Online, 2021, pp. 3995– gal text, in: Proceedings of the Natural Legal Lan- 4007. URL: https://aclanthology.org/2021.acl-long. guage Processing Workshop 2022, 2022, pp. 208– 309. doi:10.18653/v1/2021.acl-long.309. 217. [27] B. Minixhofer, J. Pfeifer, I. Vulić, Where’s the [18] D. Rudrapal, A. Jamatia, K. Chakma, A. Das, B. Gam- point? self-supervised multilingual punctuationbäck, Sentence boundary detection for social media agnostic sentence segmentation, in: A. Rogers, text, in: Proceedings of the 12th International Con- J. Boyd-Graber, N. Okazaki (Eds.), Proceedings ference on Natural Language Processing, 2015, pp. of the 61st Annual Meeting of the Association 254–260. for Computational Linguistics (Volume 1: Long [19] A. A. Azzi, H. Bouamor, S. Ferradans, The FinSBD- Papers), Association for Computational Linguis2019 shared task: Sentence boundary detection in tics, Toronto, Canada, 2023, pp. 7215–7235. URL: PDF noisy text in the financial domain, in: C.- https://aclanthology.org/2023.acl-long.398. doi:10. C. Chen, H.-H. Huang, H. Takamura, H.-H. Chen 18653/v1/2023.acl-long.398. (Eds.), Proceedings of the First Workshop on Fi- [28] A. Palmero Aprosio, G. Moretti, Tint 2.0: an allnancial Technology and Natural Language Process- inclusive suite for NLP in Italian, in: Proceedings ing, Macao, China, 2019, pp. 74–80. URL: https: of the Fifth Italian Conference on Computational //aclanthology.org/W19-5512. Linguistics (CLiC-it 2018), Accademia University [20] T. Brugger, M. Stürmer, J. Niklaus, MultiLegalSBD: Press, 2018, pp. 311–317.

a multilingual legal sentence boundary detection [29] A. Manzoni, B. Colli, I Promessi Sposi. Edizione gedataset, in: Proceedings of the Nineteenth Inter- netica della Quarantana, Casa del Manzoni, Milano, national Conference on Artificial Intelligence and 2024.

Law, 2023, pp. 42–51. [30] G. Verga, F. Cecco, I Malavoglia, Fondazione Verga[21] J. Read, R. Dridan, S. Oepen, L. J. Solberg, Sen- Interlinea, Catania-Novara, 2014. tence boundary detection: A long solved problem?, [31] C. Collodi, O. Castellani Pollidori, Le avventure in: M. Kay, C. Boitet (Eds.), Proceedings of COL- di Pinocchio, Fondazione nazionale Carlo Collodi, ING 2012: Posters, The COLING 2012 Organizing Pescia, 1983.

Committee, Mumbai, India, 2012, pp. 985–994. URL: [32] E. De Amicis, L. Tamburini, Cuore. Libro per https://aclanthology.org/C12-2096. ragazzi, Einaudi, Torino, 2018 (1° ed. 1972). [33] G. B. Bronzini, Proverbi, discorso e gesto proverbiale nei «Malavoglia», in: I Malavoglia. Atti del Congresso Internazionale di Studi (26-28 novembre 1981), Biblioteca della Fondazione Verga, Catania, 1982, pp. 637–684. [34] E. Tonani, Il ’bianco di dialogato’ e il trattamento tipografico del discorso diretto, in: E. Tonani (Ed.), Il romanzo in bianco e nero. Ricerche sull’uso degli spazi bianchi e dell’interpunzione nella narrativa italiana dall’Ottocento a oggi, Franco Cesati,

Firenze, 2010, pp. 103–136. [35] R. Pellerey, Pinocchio tra dialogo e scrittura,

Belfagor 60 (2005) 267–284. URL: https://www.jstor.

org/stable/26150287. [36] O. Castellani Pollidori, Introduzione, in: C. Collodi,

O. Castellani Pollidori (Eds.), Le avventure di Pinocchio, Fondazione nazionale Carlo Collodi, Pescia, 1983, pp. XIII–LXXXIV.

[1]

Bonomi ,

Masini ,

Morgana ,

Piotti , et al., Elementi di linguistica italiana , volume 103 , Carocci , 2010 .

[2]

D. D.

Palmer , Chapter 2: Tokenisation and sentence segmentation, Handbook of natural language processing ( 2007 ).

[3]

Dridan ,

Oepen , Document parsing: Towards realistic syntactic analysis , in: Proceedings of The 13th International Conference on Parsing Technologies (IWPT 2013 ), 2013 , pp. 127 - 133 .

[4]

Wicks ,

Post , Does sentence segmentation matter for machine translation? , in: Proceedings of the Seventh Conference on Machine Translation (WMT) , 2022 , pp. 843 - 854 .

[5]

Liu ,

Xie , Impact of automatic sentence segmentation on meeting summarization , in: 2008 IEEE International Conference on Acoustics, Speech and Signal Processing , IEEE, 2008 , pp. 5009 - 5012 .

[6]

Qi ,

Zhang ,

Bolton ,

C. D.

Manning , Stanza: A Python natural language processing toolkit for many human languages , in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , 2020 . URL: https://nlp.stanford.edu/pubs/ qi2020stanza.pdf .

[7]

Bosco ,

Montemagni ,

Simi , et al., Converting Italian Treebanks: Towards an Italian Stanford Dependency Treebank, in: Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse , The Association for Computational Linguistics , 2013 , pp. 61 - 69 .

[8]

Sanguinetti ,

Bosco ,

Lavelli ,

Mazzei ,

Antonelli ,

Tamburini , PoSTWITA-UD: an Italian Twitter treebank in Universal Dependencies , in: N. Calzolari , K.

Choukri , C.

Cieri , T.

Declerck , S.

Goggi , K.

Hasida , H.

Isahara , B.

Maegaard , J.

Mariani , H.

Mazo , A.

Moreno , J.

Odijk , S.

Piperidis , T. Tokunaga (Eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018 ), European Language Resources Association (ELRA), Miyazaki , Japan, 2018 . URL: https://aclanthology.org/L18-1279.

[9]

Tonani , Premessa. Tra punteggiatura e tipografia, in: E. Tonani (Ed.), Il romanzo in bianco e nero. Ricerche sull'uso degli spazi bianchi e dell'interpunzione nella narrativa italiana dall'Ottocento a oggi, Franco Cesati , Firenze, 2010 , pp. 13 - 28 .

[10]

Ferrari , Punteggiatura, in: G. Antonelli,

Motolese , L. Tomasi (Eds.), Storia dell'italiano scritto . Grammatiche , volume IV, Carocci, Roma, 2018 , pp. 169 - 202 .

[11]

Mortara Garavelli , Prontuario di punteggiatura,