Is Sentence Splitting a Solved Task? Experiments to the Intersection Between NLP and Italian Linguistics Arianna Redaelli1 , Rachele Sprugnoli1,* 1 Università di Parma, Via D’Azeglio, 85, 43125 Parma, Italy Abstract Sentence splitting, that is the segmentation of the raw input text into sentences, is a fundamental step in text processing. Although it is considered a solved task for texts such as news articles and Wikipedia pages, the performance of systems can vary greatly depending on the text genre. This paper presents the evaluation of the performance of eight sentence splitting tools adopting different approaches (rule-based, supervised, semi-supervised, and unsupervised learning) on Italian 19th-century novels, a genre that has not received sufficient attention so far but which can be an interesting common ground between Natural Language Processing and Digital Humanities. Keywords sentence splitting, text segmentation, literary texts, Italian 1. Introduction Stanza [6] and spaCy2 , have mostly been trained and evaluated on fairly formal texts, such as news articles and Sentence splitting is the process of segmenting a text Wikipedia pages, so the publicly reported performances into sentences1 by detecting their boundaries, which, at tend to be high, i.e. above 0.90 in terms of F1. However, least for Western languages, including Italian, usually the text genre has a significant impact on the results. For correspond to certain punctuation marks [2]. This means example, in the CoNLL 2018 shared task “Multilingual that sentence splitting, for many languages, is a mat- Parsing from Raw Text to Universal Dependencies”, the ter of punctuation disambiguation, that is, recognizing best system on the Italian ISDT treebank [7] achieved a when a punctuation mark signals a sentence boundary F1 of 0.99, while on the PoSTWITA treebank, made of or not. The importance of sentence splitting is often un- tweets [8], the highest result was 0.66. derestimated because it is considered an easy task, but its Given these variations, considering less formal text quality has a strong impact on the quality of subsequent genres could provide valuable insights into the challenges text processing because errors can propagate reducing of sentence splitting. Among these genres are literary the performance of downstream tasks such as Syntac- texts, which present unique and peculiar stylistic and tic Analysis [3], Machine Translation [4] and Automatic creative features that can break traditional grammatical Summarization [5]. norms, including punctuation ones [9]. These features de- The most popular pipeline models, such as those of pend on both authorial choices and the cultural context of the time. As a matter of facts, punctuation can vary signif- icantly depending on the historical period; literary texts CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, may follow prevailing trends or oppose them, giving rise Dec 04 — 06, 2024, Pisa, Italy to new trends. This phenomenon is particularly evident * Corresponding author. † in 19th century, when the Italian usus punctandi began This paper is the result of the collaboration between the two au- shifting from a primarily syntactic usage, prescribed by thors. For the specific concerns of the Italian academic attribution system: Rachele Sprugnoli is responsible for Sections 2, 3, 6; Ar- grammar books, to a communicative-textual usage of ianna Redaelli is responsible for Sections 1, 4, 8. Section 7 were punctuation marks [10]. Since this shift was probably collaboratively written by the two authors. influenced by the reflections and the practical uses of $ arianna.redaelli@unipr.it (A. Redaelli); prominent authors such as Alessandro Manzoni [11], our rachele.sprugnoli@unipr.it (R. Sprugnoli) study focuses on his historical novel, “I Promessi Sposi”.  0000-0001-6374-9033 (A. Redaelli); 0000-0001-6861-5595 (R. Sprugnoli) The author paid meticulous attention to the punctuation © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License of the text, revising it up to the final print proofs, and Attribution 4.0 International (CC BY 4.0). 1 By "sentence" we mean a coherent set of words constructed ac- made specific and personal choices in collaboration with cording to the general rules of the language, conveying a complete the publisher, alongside more classical ones [12]. Al- thought that makes sense on its own [1]. A sentence ends with though not always consistent, Manzoni’s decisions make a strong punctuation mark (e.g., full stop, question mark, or ex- clamation point) and is typically followed by a capital letter. The the novel particularly complex and interesting from a definition of sentence adopted here, which like any definition is punctuation perspective. Furthermore, “I Promessi Sposi” inherently problematic, is motivated by the specific requirements 2 of the present work, as will be seen below. https://spacy.io CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings has been a fundamental reference for the development of text genre on sentence splitting, but literary texts are a common written Italian language: starting from this as- rarely considered. For example, Liu et al. [16] work on sumption, many of the author’s punctuation choices have speech transcriptions, Sheik et al. [17] on legal texts, and been adopted by later grammars for rule-making, though Rudrapal et al. [18] on social media posts. Moreover, a only some of them have become part of the standard. shared task on sentence boundary detection in the finan- Given that punctuation was still undergoing standard- cial domain (FinSBD) was organized in 2019, 2020 and ization at the time, and that its use can depend not only 2021 [19]. on the conventions of the period but also on the writer’s Most of the available studies concern the processing personal style, the type of content being addressed (and of English texts while Italian is usually not included in how it is presented), and even the influence of typog- the evaluation. An interesting exception is given by a raphy during the printing process, we also decided to work on multilingual legal texts that contains a detailed broaden our study to include sections from other novels evaluation of the results on Italian documents [20]. contemporary to Manzoni’s (1840-42). Specifically, we Our work draws inspiration from the assessment on analyzed "I Malavoglia" (1881) by Giovanni Verga, "Le English texts provided by Read et al. [21] which includes, avventure di Pinocchio. Storia di un burattino" (1883) by among others, the Sherlock Holmes stories, but moving Carlo Collodi, and "Cuore" (1886) by Edmondo de Amicis. to the Italian context. Furthermore, we focus on the In this paper, our main contributions are as follows: literary context showing how 19th-century novels are a (i) we provide an estimate of the performance of eight challenge for current sentence splitting systems. sentence splitting tools adopting different approaches on a specific and challenging text genre, namely historical literary fiction texts, which has not received enough at- 3. Tools tention so far; (ii) we compare the results considering the Sentence splitting is a fundamental analysis in text pro- point of view of humanities scholars (in particular Italian cessing, for which there are many tools available, also linguistics) as the main stakeholders in the considered do- for Italian. For our evaluation we have selected eight main, in order to establish a flourishing cross-fertilization tools developed with different approaches. Some tools between NLP and Digital Humanities; (iii) we release are modules integrated in larger pipelines, others are manually split data for four 19th-century Italian novels systems specifically created to perform only sentence and a shared notebook where to run many of the tested splitting. It is important to note that selected tools do systems.3 not split in the presence of a colon or semicolon. Indeed, although recent studies in the punctuation field identify 2. Related Work the colons and semicolons as punctuation marks capable of indicating the boundary of a sentence [22], as antic- Sentence splitting systems can be categorized into three ipated in footnote 1, in this work we have decided to macro-classes based on the approach used to develop not consider them as separating marks because of the them. There are rule-based systems, such as Sentence various forms literary texts can take. To clarify the is- Splitter4 and the Sentencizer module of spaCy, that sue, we can consider the example of direct speech. In “I use heuristics specific to the various languages and lists Promessi Sposi”, direct speech can be introduced by a of exceptions and abbreviations. Then, there are super- verbum dicendi and the colons, continuing without any vised systems that need datasets in which sentences are interruption. In such cases, splitting at the colons would already correctly segmented to be trained. For example, be relatively easy. However, direct speech can also be UDPipe [13] and Stanza are trained on Universal De- embedded within a sentence that continues after the quo- pendencies (UD) treebanks [14]. Finally, unsupervised tation closes, creating a non-autonomous text portion systems are trained on datasets of non-segmented texts that, during sentence splitting, should be manually re- taking advantage of features such as the length of words connected to the one preceding the quotation itself (e.g., and collocational information. An example is given by Lucia sospirò, e ripeté: «coraggio,» con una voce che smen- Punkt, available as a module within the NLTK (Natural tiva la parola. EN: Lucia sighed, and repeated, «courage,» Language Toolkit) library [15]. In our work, we test these in a voice that belied the word.). An equally troublesome various approaches on a benchmark dataset of historical problem arises when the diegetic frame follows the quo- literary fiction texts by evaluating the performance of tation instead of preceding it. When this happens, the eight different systems. colons are absent, and other punctuation marks like com- There are several studies that analyze the impact of mas are found before the closing quotation marks or dash (e.g., «È il mio caso,» disse Renzo. EN: «That’s my case,» 3 https://github.com/RacheleSprugnoli/Sentence_Splitting_ said Renzo.). The system would not split the sentences at Manzoni these punctuation marks, yet the diegetic frame follow- 4 https://github.com/mediacloud/sentence-splitter ing the direct speech has the same value and autonomy as • WtP10 : an unsupervised multilingual sentence the one preceding it. Consequently, considering colons segmentation system based on a self-supervised and semicolons as sentence boundaries would make the learning approach tested on 85 languages, in- segmentation much more complex and often inaccurate. cluding Italian. It does not rely on punctua- Selected tools are the following: tion or sentence-segmented training data thus it is a punctuation-agnostic system [27]. Among • CoreNLP5 : an NLP pipeline written in Java and the various available models, we adopted the developed by Stanford University [23]. It contains wtp-canine-s-12l which, according to the of- various modules including ssplit that divides ficial documentation of the tool, have the best a text into sentences via a set of rules. The lat- results on languages other than English. est version of the pipeline (4.5.7) supports eight languages including Italian. For the evaluation, the tools were used as they are, • spaCy: an open-source NLP library which sup- using their default configurations, without making any ports dozens of languages, including Italian, and customization. For this reason, given the choices moti- provides four alternatives for sentence splitting. vated above, we did not consider other systems, such as Among these, statistical models for Italian have Tint [28], which by default split at colons and semicolons. been trained to split on colons and semicolons. For this reason, we tested the performance only of Sentencizer, the rule-based pipeline com- 4. Dataset ponent. The data used to evaluate the aforementioned tools are • Sentence Splitter6 : a Python module based taken from “I Promessi Sposi” in its final version pub- on scripts developed for processing the Europarl lished in 1840-184211 . 3,095 sentences, corresponding corpus [24]. It supports several languages with to 12 chapters of the novel, were manually split. This ad-hoc rules. dataset was divided into training, development and test • UDPipe7 : an NLP pipeline based on the UD frame- sets according to the proportions 80/10/10 and using the work performing tokenization, sentence splitting, UD rules for which this proportion was calculated using PoS tagging, lemmatization and syntactic analy- syntactic words as units.12 To obtain syntactic words sis. UDPipe 2 is written in Python and uses the and calculate this splitting, sentences were segmented tokenizer of UDPipe 1; among the 131 most re- and tokenized by hand; this gold standard was then pro- cent models (version 2.12), seven are for Italian. cessed with the combined Stanza model.13 Following this We evaluated the model trained on the VIT tree- division, the test set is made of 324 sentences. bank [25] that does not (always) split at colons Table 1 shows the sentence-ending punctuation marks and semicolons. in the test set. Both the total number of occurrences • Stanza8 : an NLP package written in Python and (TOTAL) and the number of times a sign is an end-of- based on neural network components. Sentence sentence marker (EOS) are reported. In addition to the splitting is jointly performed with tokenization by full stop, sentence boundaries can be indicated by ex- the TokenizeProcessor module. The default pressive punctuation marks (!, ?) when followed by a Italian model is a combination of multiple UD capital letter. If followed by a lowercase letter, instead, treebanks. these marks only have an expressive role, modifying • Ersatz9 : a language-agnostic neural model the sentence’s internal intonation without determining based on a semi-supervised training paradigm. its end. Low quotation marks («») and long dashes (–), It combines the use of regular-expressions to used for direct speech and thoughts respectively, typi- detect candidate sentence boundaries with a cally determine a sentence boundary when they appear Transformer-based binary classifier [26]. with another demarcative punctuation mark (e.g., a full • Punkt: an unsupervised system which uses col- stop). In Manzoni’s novel, if a closing quotation mark locational information to identify abbreviations, (guillemets or long dashes) appears with another punctu- initials, and ordinal numbers. All punctuation ation mark, the latter is usually placed before the former, not included in these elements is considered an end-of-sentence marker. 10 https://github.com/segment-any-text/wtpsplit 11 The text, fully digitized and available online, was collated with the reference edition [29] prior to analysis, to ensure maximum 5 fidelity to the author’s punctuation choices. https://stanfordnlp.github.io/CoreNLP/ 6 12 https://github.com/mediacloud/sentence-splitter https://universaldependencies.org/release_checklist.html# 7 data-split https://ufal.mff.cuni.cz/udpipe 8 13 https://stanfordnlp.github.io/stanza/ The output of this process was used to train a new Stanza model 9 https://github.com/rewicks/ersatz as reported in Section 6. Table 1 sign of the low quotation marks is not recognized End-of-sentence markers in the test set. as a sentence boundary, so in the automatic seg- MARK # TOTAL # EOS mentation it can appear at the beginning or in . 277 237 the middle of a sentence. » 90 53 2. In supervised systems semicolons and colons are ? 47 22 sometimes considered as sentence boundary sig- ! 31 6 nals. Indeed, in the VIT treebank and in those ... 23 3 used to train the combined Stanza model, sen- – 10 3 tences are segmented inconsistently: sometimes semicolons and colons are strong punctuation, and sometimes not. which formally closes the sentence. Lastly, in the novel, 3. Suspension points are always considered strong suspension points (...) can indicate a sentence bound- punctuation marks and the sentence is splitted ary when they suggest a suspensive allusion or when after them. they mark the interruption of a character’s line due to 4. A sentence is often split after an expressive punc- linguistic or extra-linguistic contingencies. In such cases, tuation mark (?, !) even if it is followed by a suspension points’ demarcative function is shown either lowercase letter. by the following capital letter or by an opening quota- 5. The long dash is not recognized as a sentence- tion mark which indicates the beginning of a different ending marker; consequently, either the sentence character’s line. continues after the dash or the dash appears at the beginning of the following sentence. 5. Results of the Evaluation Table 2 reports the results of our evaluation in terms 6. Training a New Stanza Model of F1. The best performance (0.94) is registered with With the rest of the manually split data, namely 2,447 Sentence Splitter, a rule-based system. All other sentences for the training set and 324 for the development tools do not exceed 0.70, thus having significantly lower set, a new Stanza model specific for Manzoni’s text was performances than those reported on contemporary Ital- trained. Different amounts of sentences were used as ian texts. For example, the official result of UDPipe 2 training in order to control the effect of the dataset size on the VIT treebank with the 2.12 model starting from on the performance. The results obtained with 1500 steps a raw text is 0.95, that is almost 30 points more than are the following: what is obtained on our test set. The lowest result (0.51) is obtained by the unsupervised WtP system. Although • 300 sentences: 0.97 F1 the rule-based approach seems to be the most promising, • 1000 sentences: 0.98 F1 only Sentence Splitter has an excellent result even • 2,447 sentences: 0.99 F1 without any adaptation of the existing rules. With just 300 sentences there is already a clear improve- ment over the default model, obtaining an even higher Table 2 result than the one obtained with Sentence Splitter, Results (in terms of F1) of eight systems developed with the system that had proven to be the best on our test set. different approaches: rule-based (RB), supervised (S), semi- supervised (SS) and unsupervised learning (U). 7. What About Other Novels? TYPE SYSTEM F1 RB spaCy sentencizer 0.61 Table 4 displays the performance of the same systems CoreNLP 4.5.7 ssplit 0.66 tested on “I Promessi Sposi” on the first approximately SentenceSplitter 0.94 S UDPipe 2 VIT model 0.66 90 sentences of three other important 19th-century nov- Stanza combined 0.69 els:14 “I Malavoglia” (1881) by Giovanni Verga [30], “Le SS Ersatz 0.60 avventure di Pinocchio. Storia di un burattino” (1883) by U Punkt 0.68 Carlo Collodi [31], “Cuore” (1886) by Edmondo de Amicis WtP wtp-canine-s-12l 0.51 [32].15 14 The reference edition text was used for the analysis of these novels Analyzing the outputs of the various systems, it is too. possible to notice some recurring errors (few examples 15 86 sentences are taken from “I Malavoglia”, corresponding to the are reported in Table 3): first chapter of the novel; 93 sentences, that is the first two chapters, come from “Le avventure di Pinocchio”; 87 sentences are taken 1. Misinterpretation of guillemets («,»). The closing “Cuore”, corresponding to the first three chapters of the novel. Table 3 Examples of errors in two of the tested systems compared with the manually splitted sentences. TEST GOLD UDPipe 2 -VIT model Ersatz 1) «Al sagrestano gli crede?» 1) » «Al sagrestano gli crede? 1) » «Al sagrestano gli crede?» «Perché?» 2) «Perché?» 2) » «Perché? 1) – È lei, di certo!– 1) – È lei, di certo!– Era proprio lei, 1) – È lei, di certo! 2) Era proprio lei, con la buona vedova. con la buona vedova. 2) – Era proprio lei, con la buona vedova. 1) Anche Agnese, veda; anche Agnese. . . » 1) Anche Agnese, veda; anche Agnese. . . » 1) Anche Agnese, veda; anche Agnese. . . » 2) «Uh! ha voglia di scherzare, lei,» «Uh! ha voglia di scherzare, lei,» «Uh! disse questa. disse questa. 2) ha voglia di scherzare, lei,» disse questa. « Table 4 whether introduced by colons or not, and sometimes Results on about 90 sentences taken from other 19th-century isolate a complete enunciative section. The long dash (–), novels. Stanza retr. refers to the model retrained on instead, has a number of different functions [34]: one of Manzoni’s novel, as described in Section 6. these is to signal direct speech, but often marking only Malavoglia Pinocchio Cuore its beginning and not its end. This leads, on one hand, spaCy 0.73 0.35 0.84 to a variety of ways of handling parenthetical elements CoreNLP ssplit 0.76 0.72 0.62 and, on the other hand, to a blurred boundary between SentenceSplit. 0.77 0.45 0.68 the characters’ speech, the characters’ speech mediated UDPipe 0.75 0.79 0.67 by the narrator, and the narrator’s own discourse. Stanza 0.71 0.70 0.61 “Pinocchio”, a novel written for a young audience, is Stanza retr. 0.90 0.89 0.69 characterized by a strongly dialogic style [35]. For direct Ersatz 0.72 0.75 0.66 speech, including the simulated dialogue between the Punkt 0.73 0.77 0.66 narrator and the reader, the long dash (–) is abundantly WtP 0.53 0.78 0.39 used, but as for "I Malavoglia", the opening dashes are not always accompanied by the closing ones. Additionally, Collodi frequently uses punctuation clusters, specifically The results obtained are once again lower than those the exclamation mark followed by suspension points (!...), reported for contemporary texts but the model retrained at the end of sentences [36], a possibility mostly not on “I Promessi Sposi” shows improved performance for contemplated by late 19th-century grammars. all novels, especially when applied on “I Malavoglia” and Lastly, Edmondo de Amicis’s novel “Cuore” tells the on “Le avventure di Pinocchio” (+19 points with respect story of a child’s school experience from his point of view, to the default Stanza combined model in both cases); adopting a diary-like structure. In “Cuore”, the linguistic the improvement is more limited for “Cuore” (+ 8 points). form is simple and plain: the sentences are mainly short The rule-based approach is promising but with dif- and often end with a standard strong punctuation mark, ferent systems (spaCy for “Cuore” and ssplit for “I followed by a capital letter. Direct speech is clearly indi- Malavoglia”). Instead, the VIT model of UDPipe, and cated by long dashes (–), but successive lines of dialogue therefore a supervised approach, is the best on “Le avven- are arranged consecutively on the page, and in such cases, ture di Pinocchio”. Some tools obtain extremely different the closing dash of the previous line also serves as the results depending on the text they process. spaCy and opening dash of the next line. Since the lines of dialogue Sentence Splitter record a very low result on “Le are perfectly integrated into the narrative structure, they avventure di Pinocchio” (0.35 and 0.45 respectively) while can end with various punctuation marks, from commas WtP has an F1 of only 0.39 on “Cuore”, half of what it to semicolons to full stops. When the punctuation mark achieved on “Le avventure di Pinocchio”. is not strong, after the preliminary conclusion of the line, This diversified situation is principally due to the fact the text continues with the narrator’s discourse. that each novel presents unique characteristics, even in Beyond the specific differences listed schematically punctuation. above, there are also some common typographical and “I Malavoglia” is a choral novel in which the various punctuation features among the considered novels. For styles of speech of the characters and the narrative voice example, when a closing quotation mark appears with are mixed together. Punctuation marks largely represent another punctuation mark, the latter in general occurs this mixture. Indeed, among the main peculiarities of before the former, as found in “I Promessi Sposi”. the novel is the original and personal use of quotation marks. For example, guillemets («,») are frequently used to refer to popular sayings and proverbs as well as to short formulas [33], which sometimes intersperse the diegesis, 8. Conclusions References This paper presents an assessment of the performance [1] I. Bonomi, A. Masini, S. Morgana, M. Piotti, et al., of eight sentence splitting tools adopting different ap- Elementi di linguistica italiana, volume 103, Carocci, proaches on four 19th-century novels: "I Promessi Sposi" 2010. by Alessandro Manzoni, "I Malavoglia" by Giovanni [2] D. D. Palmer, Chapter 2: Tokenisation and sen- Verga", "Le avventure di Pinocchio" by Carlo Collodi, and tence segmentation, Handbook of natural language "Cuore" by Edmondo de Amicis. Although these texts processing (2007). belong to the same historical period, they show specific [3] R. Dridan, S. Oepen, Document parsing: Towards features depending on the form and content of the novel realistic syntactic analysis, in: Proceedings of The as well as the author’s stylistic choices. Among these 13th International Conference on Parsing Technolo- features is punctuation, which in the late 19th century gies (IWPT 2013), 2013, pp. 127–133. had not reached a detectable stability yet and was rather [4] R. Wicks, M. Post, Does sentence segmentation experiencing a paradigmatic change. matter for machine translation?, in: Proceedings Since sentence splitting for Western languages, includ- of the Seventh Conference on Machine Translation ing Italian, relies heavily on punctuation disambiguation, (WMT), 2022, pp. 843–854. applying existing tools to the four novels considered has [5] Y. Liu, S. Xie, Impact of automatic sentence segmen- resulted in performances well below the standards. These tation on meeting summarization, in: 2008 IEEE texts demonstrate that sentence splitting is not a com- International Conference on Acoustics, Speech and pletely solved task. Signal Processing, IEEE, 2008, pp. 5009–5012. On the other hand, applying the model retrained on “I [6] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, C. D. Man- Promessi Sposi” to the other three novels showed signifi- ning, Stanza: A Python natural language processing cant improvements for “Le avventure di Pinocchio” and toolkit for many human languages, in: Proceed- “I Malavoglia”, and a moderate improvement for “Cuore.” ings of the 58th Annual Meeting of the Associa- This result suggests that shared historical context and tion for Computational Linguistics: System Demon- belonging to the same textual genre may offer sufficient strations, 2020. URL: https://nlp.stanford.edu/pubs/ similarities to improve the model’s performance. How- qi2020stanza.pdf. ever, the example of "Cuore" is evidence of how this is [7] C. Bosco, S. Montemagni, M. Simi, et al., Converting sometimes not enough: some specific features in form, Italian Treebanks: Towards an Italian Stanford De- punctuation and style continue to affect sentence split- pendency Treebank, in: Proceedings of the 7th Lin- ting, demonstrating that although retraining may mit- guistic Annotation Workshop and Interoperability igate some problems, it does not completely overcome with Discourse, The Association for Computational the inherent variability of these texts. Linguistics, 2013, pp. 61–69. Philologists have increasingly focused on preserving [8] M. Sanguinetti, C. Bosco, A. Lavelli, A. Mazzei, the original punctuation as a part of the author’s creation O. Antonelli, F. Tamburini, PoSTWITA-UD: an of the text, providing valuable and reliable supports of Italian Twitter treebank in Universal Dependen- study for scholars of linguistics and the history of the Ital- cies, in: N. Calzolari, K. Choukri, C. Cieri, T. De- ian language. Their combined knowledge is precious for clerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, achieving accurate sentence splitting in these texts. Thus, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, sentence splitting can be an interesting common ground T. Tokunaga (Eds.), Proceedings of the Eleventh In- between different disciplines, potentially leading to the ternational Conference on Language Resources and development of tools for the automatic analysis of his- Evaluation (LREC 2018), European Language Re- torical literary texts. This field remains under-explored sources Association (ELRA), Miyazaki, Japan, 2018. in the Italian context, offering significant opportunities URL: https://aclanthology.org/L18-1279. for further study and cross-disciplinary collaboration. [9] E. Tonani, Premessa. Tra punteggiatura e ti- pografia, in: E. Tonani (Ed.), Il romanzo in bianco e nero. Ricerche sull’uso degli spazi Acknowledgments bianchi e dell’interpunzione nella narrativa italiana dall’Ottocento a oggi, Franco Cesati, Firenze, 2010, Questa pubblicazione è stata realizzata da ricercatrice pp. 13–28. con contratto di ricerca cofinanziato dall’Unione europea [10] A. Ferrari, Punteggiatura, in: G. Antonelli, M. Mo- - PON Ricerca e Innovazione 2014-2020 ai sensi dell’art. tolese, L. Tomasi (Eds.), Storia dell’italiano scritto. 24, comma 3, lett. a, della Legge 30 dicembre 2010, n. 240 Grammatiche, volume IV, Carocci, Roma, 2018, pp. e s.m.i. e del D.M. 10 agosto 2021 n. 1062. 169–202. [11] B. Mortara Garavelli, Prontuario di punteggiatura, Laterza, Bari, 2003. [22] A. Ferrari, L. Lala, F. Longo, F. Pecorari, B. Rosi, [12] A. Manzoni, F. Ghisalberti, A. Chiari, L’ultima re- R. Stojmenova, La punteggiatura italiana contem- visione dei Promessi Sposi, in: Tutte le opere di poranea. Un’analisi comunicativo-testuale, Carocci, Alessandro Manzoni. I Promessi Sposi, volume II, Roma, 2018. Mondadori, Milano, 1954, pp. 789–989. [23] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, [13] M. Straka, UDPipe 2.0 prototype at CoNLL 2018 S. Bethard, D. McClosky, The Stanford CoreNLP UD shared task, in: D. Zeman, J. Hajič (Eds.), Pro- natural language processing toolkit, in: Proceed- ceedings of the CoNLL 2018 Shared Task: Multilin- ings of 52nd annual meeting of the association for gual Parsing from Raw Text to Universal Depen- computational linguistics: system demonstrations, dencies, Association for Computational Linguis- 2014, pp. 55–60. tics, Brussels, Belgium, 2018, pp. 197–207. URL: [24] P. Koehn, Europarl: A parallel corpus for statistical https://aclanthology.org/K18-2020. doi:10.18653/ machine translation, in: Proceedings of Machine v1/K18-2020. Translation Summit X: Papers, Phuket, Thailand, [14] M.-C. De Marneffe, C. D. Manning, J. Nivre, D. Ze- 2005, pp. 79–86. URL: https://aclanthology.org/2005. man, Universal Dependencies, Computational lin- mtsummit-papers.11. guistics 47 (2021) 255–308. [25] R. Delmonte, A. Bristot, S. Tonelli, VIT-Venice Ital- [15] T. Kiss, J. Strunk, Unsupervised multilin- ian Treebank: Syntactic and quantitative features., gual sentence boundary detection, Computa- in: Sixth International Workshop on Treebanks and tional Linguistics 32 (2006) 485–525. URL: https: Linguistic Theories, volume 1, Northern European //aclanthology.org/J06-4003. doi:10.1162/coli. Association for Language Technol, 2007, pp. 43–54. 2006.32.4.485. [26] R. Wicks, M. Post, A unified approach to sentence [16] Y. Liu, A. Stolcke, E. Shriberg, M. Harper, Using segmentation of punctuated text in many languages, conditional random fields for sentence boundary in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Proceed- detection in speech, in: Proceedings of the 43rd an- ings of the 59th Annual Meeting of the Association nual meeting of the Association for Computational for Computational Linguistics and the 11th Interna- Linguistics (ACL’05), 2005, pp. 451–458. tional Joint Conference on Natural Language Pro- [17] R. Sheik, T. Gokul, S. Nirmala, Efficient deep cessing (Volume 1: Long Papers), Association for learning-based sentence boundary detection in le- Computational Linguistics, Online, 2021, pp. 3995– gal text, in: Proceedings of the Natural Legal Lan- 4007. URL: https://aclanthology.org/2021.acl-long. guage Processing Workshop 2022, 2022, pp. 208– 309. doi:10.18653/v1/2021.acl-long.309. 217. [27] B. Minixhofer, J. Pfeiffer, I. Vulić, Where’s the [18] D. Rudrapal, A. Jamatia, K. Chakma, A. Das, B. Gam- point? self-supervised multilingual punctuation- bäck, Sentence boundary detection for social media agnostic sentence segmentation, in: A. Rogers, text, in: Proceedings of the 12th International Con- J. Boyd-Graber, N. Okazaki (Eds.), Proceedings ference on Natural Language Processing, 2015, pp. of the 61st Annual Meeting of the Association 254–260. for Computational Linguistics (Volume 1: Long [19] A. A. Azzi, H. Bouamor, S. Ferradans, The FinSBD- Papers), Association for Computational Linguis- 2019 shared task: Sentence boundary detection in tics, Toronto, Canada, 2023, pp. 7215–7235. URL: PDF noisy text in the financial domain, in: C.- https://aclanthology.org/2023.acl-long.398. doi:10. C. Chen, H.-H. Huang, H. Takamura, H.-H. Chen 18653/v1/2023.acl-long.398. (Eds.), Proceedings of the First Workshop on Fi- [28] A. Palmero Aprosio, G. Moretti, Tint 2.0: an all- nancial Technology and Natural Language Process- inclusive suite for NLP in Italian, in: Proceedings ing, Macao, China, 2019, pp. 74–80. URL: https: of the Fifth Italian Conference on Computational //aclanthology.org/W19-5512. Linguistics (CLiC-it 2018), Accademia University [20] T. Brugger, M. Stürmer, J. Niklaus, MultiLegalSBD: Press, 2018, pp. 311–317. a multilingual legal sentence boundary detection [29] A. Manzoni, B. Colli, I Promessi Sposi. Edizione ge- dataset, in: Proceedings of the Nineteenth Inter- netica della Quarantana, Casa del Manzoni, Milano, national Conference on Artificial Intelligence and 2024. Law, 2023, pp. 42–51. [30] G. Verga, F. Cecco, I Malavoglia, Fondazione Verga- [21] J. Read, R. Dridan, S. Oepen, L. J. Solberg, Sen- Interlinea, Catania-Novara, 2014. tence boundary detection: A long solved problem?, [31] C. Collodi, O. Castellani Pollidori, Le avventure in: M. Kay, C. Boitet (Eds.), Proceedings of COL- di Pinocchio, Fondazione nazionale Carlo Collodi, ING 2012: Posters, The COLING 2012 Organizing Pescia, 1983. Committee, Mumbai, India, 2012, pp. 985–994. URL: [32] E. De Amicis, L. Tamburini, Cuore. Libro per https://aclanthology.org/C12-2096. ragazzi, Einaudi, Torino, 2018 (1° ed. 1972). [33] G. B. Bronzini, Proverbi, discorso e gesto prover- biale nei «Malavoglia», in: I Malavoglia. Atti del Congresso Internazionale di Studi (26-28 novembre 1981), Biblioteca della Fondazione Verga, Catania, 1982, pp. 637–684. [34] E. Tonani, Il ’bianco di dialogato’ e il trattamento tipografico del discorso diretto, in: E. Tonani (Ed.), Il romanzo in bianco e nero. Ricerche sull’uso degli spazi bianchi e dell’interpunzione nella nar- rativa italiana dall’Ottocento a oggi, Franco Cesati, Firenze, 2010, pp. 103–136. [35] R. Pellerey, Pinocchio tra dialogo e scrittura, Belfagor 60 (2005) 267–284. URL: https://www.jstor. org/stable/26150287. [36] O. Castellani Pollidori, Introduzione, in: C. Collodi, O. Castellani Pollidori (Eds.), Le avventure di Pinoc- chio, Fondazione nazionale Carlo Collodi, Pescia, 1983, pp. XIII–LXXXIV.