<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Is Sentence Splitting a Solved Task? Experiments to the Intersection Between NLP and Italian Linguistics</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Arianna</forename><surname>Redaelli</surname></persName>
							<email>arianna.redaelli@unipr.it</email>
							<affiliation key="aff0">
								<orgName type="institution">Università di Parma</orgName>
								<address>
									<addrLine>Via D&apos;Azeglio, 85</addrLine>
									<postCode>43125</postCode>
									<settlement>Parma</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Rachele</forename><surname>Sprugnoli</surname></persName>
							<email>rachele.sprugnoli@unipr.it</email>
							<affiliation key="aff0">
								<orgName type="institution">Università di Parma</orgName>
								<address>
									<addrLine>Via D&apos;Azeglio, 85</addrLine>
									<postCode>43125</postCode>
									<settlement>Parma</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="department">Tenth Italian Conference on Computational Linguistics</orgName>
								<address>
									<addrLine>Dec 04 -06</addrLine>
									<postCode>2024</postCode>
									<settlement>Pisa</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Is Sentence Splitting a Solved Task? Experiments to the Intersection Between NLP and Italian Linguistics</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">E371222F538323E8C0DE1940D361BBE1</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:38+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>sentence splitting</term>
					<term>text segmentation</term>
					<term>literary texts</term>
					<term>Italian</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Sentence splitting, that is the segmentation of the raw input text into sentences, is a fundamental step in text processing. Although it is considered a solved task for texts such as news articles and Wikipedia pages, the performance of systems can vary greatly depending on the text genre. This paper presents the evaluation of the performance of eight sentence splitting tools adopting different approaches (rule-based, supervised, semi-supervised, and unsupervised learning) on Italian 19th-century novels, a genre that has not received sufficient attention so far but which can be an interesting common ground between Natural Language Processing and Digital Humanities.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Sentence splitting is the process of segmenting a text into sentences 1 by detecting their boundaries, which, at least for Western languages, including Italian, usually correspond to certain punctuation marks <ref type="bibr" target="#b1">[2]</ref>. This means that sentence splitting, for many languages, is a matter of punctuation disambiguation, that is, recognizing when a punctuation mark signals a sentence boundary or not. The importance of sentence splitting is often underestimated because it is considered an easy task, but its quality has a strong impact on the quality of subsequent text processing because errors can propagate reducing the performance of downstream tasks such as Syntactic Analysis <ref type="bibr" target="#b2">[3]</ref>, Machine Translation <ref type="bibr" target="#b3">[4]</ref> and Automatic Summarization <ref type="bibr" target="#b4">[5]</ref>.</p><p>The most popular pipeline models, such as those of 1 By "sentence" we mean a coherent set of words constructed according to the general rules of the language, conveying a complete thought that makes sense on its own <ref type="bibr" target="#b0">[1]</ref>. A sentence ends with a strong punctuation mark (e.g., full stop, question mark, or exclamation point) and is typically followed by a capital letter. The definition of sentence adopted here, which like any definition is inherently problematic, is motivated by the specific requirements of the present work, as will be seen below.</p><p>Stanza <ref type="bibr" target="#b5">[6]</ref> and spaCy 2 , have mostly been trained and evaluated on fairly formal texts, such as news articles and Wikipedia pages, so the publicly reported performances tend to be high, i.e. above 0.90 in terms of F1. However, the text genre has a significant impact on the results. For example, in the CoNLL 2018 shared task "Multilingual Parsing from Raw Text to Universal Dependencies", the best system on the Italian ISDT treebank <ref type="bibr" target="#b6">[7]</ref> achieved a F1 of 0.99, while on the PoSTWITA treebank, made of tweets <ref type="bibr" target="#b7">[8]</ref>, the highest result was 0.66. Given these variations, considering less formal text genres could provide valuable insights into the challenges of sentence splitting. Among these genres are literary texts, which present unique and peculiar stylistic and creative features that can break traditional grammatical norms, including punctuation ones <ref type="bibr" target="#b8">[9]</ref>. These features depend on both authorial choices and the cultural context of the time. As a matter of facts, punctuation can vary significantly depending on the historical period; literary texts may follow prevailing trends or oppose them, giving rise to new trends. This phenomenon is particularly evident in 19th century, when the Italian usus punctandi began shifting from a primarily syntactic usage, prescribed by grammar books, to a communicative-textual usage of punctuation marks <ref type="bibr" target="#b9">[10]</ref>. Since this shift was probably influenced by the reflections and the practical uses of prominent authors such as Alessandro Manzoni <ref type="bibr" target="#b10">[11]</ref>, our study focuses on his historical novel, "I Promessi Sposi". The author paid meticulous attention to the punctuation of the text, revising it up to the final print proofs, and made specific and personal choices in collaboration with the publisher, alongside more classical ones <ref type="bibr" target="#b11">[12]</ref>. Although not always consistent, Manzoni's decisions make the novel particularly complex and interesting from a punctuation perspective. Furthermore, "I Promessi Sposi" has been a fundamental reference for the development of a common written Italian language: starting from this assumption, many of the author's punctuation choices have been adopted by later grammars for rule-making, though only some of them have become part of the standard. Given that punctuation was still undergoing standardization at the time, and that its use can depend not only on the conventions of the period but also on the writer's personal style, the type of content being addressed (and how it is presented), and even the influence of typography during the printing process, we also decided to broaden our study to include sections from other novels contemporary to <ref type="bibr">Manzoni's (1840-42)</ref>. Specifically, we analyzed "I Malavoglia" (1881) by Giovanni Verga, "Le avventure di Pinocchio. Storia di un burattino" (1883) by Carlo Collodi, and "Cuore" (1886) by Edmondo de Amicis.</p><p>In this paper, our main contributions are as follows: (i) we provide an estimate of the performance of eight sentence splitting tools adopting different approaches on a specific and challenging text genre, namely historical literary fiction texts, which has not received enough attention so far; (ii) we compare the results considering the point of view of humanities scholars (in particular Italian linguistics) as the main stakeholders in the considered domain, in order to establish a flourishing cross-fertilization between NLP and Digital Humanities; (iii) we release manually split data for four 19th-century Italian novels and a shared notebook where to run many of the tested systems. <ref type="foot" target="#foot_0">3</ref></p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Sentence splitting systems can be categorized into three macro-classes based on the approach used to develop them. There are rule-based systems, such as Sentence Splitter 4 and the Sentencizer module of spaCy, that use heuristics specific to the various languages and lists of exceptions and abbreviations. Then, there are supervised systems that need datasets in which sentences are already correctly segmented to be trained. For example, UDPipe <ref type="bibr" target="#b12">[13]</ref> and Stanza are trained on Universal Dependencies (UD) treebanks <ref type="bibr" target="#b13">[14]</ref>. Finally, unsupervised systems are trained on datasets of non-segmented texts taking advantage of features such as the length of words and collocational information. An example is given by Punkt, available as a module within the NLTK (Natural Language Toolkit) library <ref type="bibr" target="#b14">[15]</ref>. In our work, we test these various approaches on a benchmark dataset of historical literary fiction texts by evaluating the performance of eight different systems.</p><p>There are several studies that analyze the impact of text genre on sentence splitting, but literary texts are rarely considered. For example, Liu et al. <ref type="bibr" target="#b15">[16]</ref> work on speech transcriptions, Sheik et al. <ref type="bibr" target="#b16">[17]</ref> on legal texts, and Rudrapal et al. <ref type="bibr" target="#b17">[18]</ref> on social media posts. Moreover, a shared task on sentence boundary detection in the financial domain (FinSBD) was organized in 2019, 2020 and 2021 <ref type="bibr" target="#b18">[19]</ref>.</p><p>Most of the available studies concern the processing of English texts while Italian is usually not included in the evaluation. An interesting exception is given by a work on multilingual legal texts that contains a detailed evaluation of the results on Italian documents <ref type="bibr" target="#b19">[20]</ref>.</p><p>Our work draws inspiration from the assessment on English texts provided by Read et al. <ref type="bibr" target="#b20">[21]</ref> which includes, among others, the Sherlock Holmes stories, but moving to the Italian context. Furthermore, we focus on the literary context showing how 19th-century novels are a challenge for current sentence splitting systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Tools</head><p>Sentence splitting is a fundamental analysis in text processing, for which there are many tools available, also for Italian. For our evaluation we have selected eight tools developed with different approaches. Some tools are modules integrated in larger pipelines, others are systems specifically created to perform only sentence splitting. It is important to note that selected tools do not split in the presence of a colon or semicolon. Indeed, although recent studies in the punctuation field identify the colons and semicolons as punctuation marks capable of indicating the boundary of a sentence <ref type="bibr" target="#b21">[22]</ref>, as anticipated in footnote 1, in this work we have decided to not consider them as separating marks because of the various forms literary texts can take. To clarify the issue, we can consider the example of direct speech. In "I Promessi Sposi", direct speech can be introduced by a verbum dicendi and the colons, continuing without any interruption. In such cases, splitting at the colons would be relatively easy. However, direct speech can also be embedded within a sentence that continues after the quotation closes, creating a non-autonomous text portion that, during sentence splitting, should be manually reconnected to the one preceding the quotation itself (e.g., Lucia sospirò, e ripeté: «coraggio,» con una voce che smentiva la parola. EN: Lucia sighed, and repeated, «courage,» in a voice that belied the word.). An equally troublesome problem arises when the diegetic frame follows the quotation instead of preceding it. When this happens, the colons are absent, and other punctuation marks like commas are found before the closing quotation marks or dash (e.g., «È il mio caso,» disse Renzo. EN: «That's my case,» said Renzo.). The system would not split the sentences at these punctuation marks, yet the diegetic frame follow-ing the direct speech has the same value and autonomy as the one preceding it. Consequently, considering colons and semicolons as sentence boundaries would make the segmentation much more complex and often inaccurate.</p><p>Selected tools are the following:</p><p>• CoreNLP<ref type="foot" target="#foot_2">5</ref> : an NLP pipeline written in Java and developed by Stanford University <ref type="bibr" target="#b22">[23]</ref>.  <ref type="bibr" target="#b25">[26]</ref>. • Punkt: an unsupervised system which uses collocational information to identify abbreviations, initials, and ordinal numbers. All punctuation not included in these elements is considered an end-of-sentence marker.</p><p>• WtP<ref type="foot" target="#foot_7">10</ref> : an unsupervised multilingual sentence segmentation system based on a self-supervised learning approach tested on 85 languages, including Italian. It does not rely on punctuation or sentence-segmented training data thus it is a punctuation-agnostic system <ref type="bibr" target="#b26">[27]</ref>. Among the various available models, we adopted the wtp-canine-s-12l which, according to the official documentation of the tool, have the best results on languages other than English.</p><p>For the evaluation, the tools were used as they are, using their default configurations, without making any customization. For this reason, given the choices motivated above, we did not consider other systems, such as Tint <ref type="bibr" target="#b27">[28]</ref>, which by default split at colons and semicolons.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Dataset</head><p>The data used to evaluate the aforementioned tools are taken from "I Promessi Sposi" in its final version published in 1840-1842 <ref type="foot" target="#foot_8">11</ref> . 3,095 sentences, corresponding to 12 chapters of the novel, were manually split. This dataset was divided into training, development and test sets according to the proportions 80/10/10 and using the UD rules for which this proportion was calculated using syntactic words as units. <ref type="foot" target="#foot_9">12</ref> To obtain syntactic words and calculate this splitting, sentences were segmented and tokenized by hand; this gold standard was then processed with the combined Stanza model. <ref type="foot" target="#foot_10">13</ref> Following this division, the test set is made of 324 sentences.</p><p>Table <ref type="table" target="#tab_1">1</ref> shows the sentence-ending punctuation marks in the test set. Both the total number of occurrences (TOTAL) and the number of times a sign is an end-ofsentence marker (EOS) are reported. In addition to the full stop, sentence boundaries can be indicated by expressive punctuation marks (!, ?) when followed by a capital letter. If followed by a lowercase letter, instead, these marks only have an expressive role, modifying the sentence's internal intonation without determining its end. Low quotation marks («») and long dashes (-), used for direct speech and thoughts respectively, typically determine a sentence boundary when they appear with another demarcative punctuation mark (e.g., a full stop). In Manzoni's novel, if a closing quotation mark (guillemets or long dashes) appears with another punctuation mark, the latter is usually placed before the former,   Analyzing the outputs of the various systems, it is possible to notice some recurring errors (few examples are reported in Table <ref type="table">3</ref>):</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results of the Evaluation</head><p>1. Misinterpretation of guillemets («,»). The closing sign of the low quotation marks is not recognized as a sentence boundary, so in the automatic segmentation it can appear at the beginning or in the middle of a sentence. 2. In supervised systems semicolons and colons are sometimes considered as sentence boundary signals. Indeed, in the VIT treebank and in those used to train the combined Stanza model, sentences are segmented inconsistently: sometimes semicolons and colons are strong punctuation, and sometimes not. 3. Suspension points are always considered strong punctuation marks and the sentence is splitted after them. 4. A sentence is often split after an expressive punctuation mark (?, !) even if it is followed by a lowercase letter. 5. The long dash is not recognized as a sentenceending marker; consequently, either the sentence continues after the dash or the dash appears at the beginning of the following sentence.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Training a New Stanza Model</head><p>With the rest of the manually split data, namely 2,447 sentences for the training set and 324 for the development set, a new Stanza model specific for Manzoni's text was trained. Different amounts of sentences were used as training in order to control the effect of the dataset size on the performance. The results obtained with 1500 steps are the following:</p><p>• 300 sentences: 0.97 F1 • 1000 sentences: 0.98 F1 • 2,447 sentences: 0.99 F1 With just 300 sentences there is already a clear improvement over the default model, obtaining an even higher result than the one obtained with Sentence Splitter, the system that had proven to be the best on our test set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">What About Other Novels?</head><p>Table <ref type="table" target="#tab_4">4</ref> displays the performance of the same systems tested on "I Promessi Sposi" on the first approximately 90 sentences of three other important 19th-century novels: <ref type="foot" target="#foot_11">14</ref> "I Malavoglia" (1881) by Giovanni Verga <ref type="bibr" target="#b29">[30]</ref>, "Le avventure di Pinocchio. Storia di un burattino" (1883) by Carlo Collodi <ref type="bibr" target="#b30">[31]</ref>, "Cuore" (1886) by Edmondo de Amicis <ref type="bibr" target="#b31">[32]</ref>.<ref type="foot" target="#foot_12">15</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 3</head><p>Examples of errors in two of the tested systems compared with the manually splitted sentences.</p><p>TEST GOLD UDPipe 2 -VIT model Ersatz 1) «Al sagrestano gli crede?» 2) «Perché?» 1) » «Al sagrestano gli crede?» «Perché?» 1) » «Al sagrestano gli crede? 2) » «Perché? 1) -È lei, di certo!-2) Era proprio lei, con la buona vedova. 1) -È lei, di certo!-Era proprio lei, con la buona vedova. 1) -È lei, di certo! 2) -Era proprio lei, con la buona vedova. 1) Anche Agnese, veda; anche Agnese. . . » 2) «Uh! ha voglia di scherzare, lei,» disse questa. 1) Anche Agnese, veda; anche Agnese. . . » «Uh! ha voglia di scherzare, lei,» disse questa. 1) Anche Agnese, veda; anche Agnese. . . » «Uh! 2) ha voglia di scherzare, lei,» disse questa. « The results obtained are once again lower than those reported for contemporary texts but the model retrained on "I Promessi Sposi" shows improved performance for all novels, especially when applied on "I Malavoglia" and on "Le avventure di Pinocchio" (+19 points with respect to the default Stanza combined model in both cases); the improvement is more limited for "Cuore" (+ 8 points).</p><p>The rule-based approach is promising but with different systems (spaCy for "Cuore" and ssplit for "I Malavoglia"). Instead, the VIT model of UDPipe, and therefore a supervised approach, is the best on "Le avventure di Pinocchio". Some tools obtain extremely different results depending on the text they process. spaCy and Sentence Splitter record a very low result on "Le avventure di Pinocchio" (0.35 and 0.45 respectively) while WtP has an F1 of only 0.39 on "Cuore", half of what it achieved on "Le avventure di Pinocchio". This diversified situation is principally due to the fact that each novel presents unique characteristics, even in punctuation.</p><p>"I Malavoglia" is a choral novel in which the various styles of speech of the characters and the narrative voice are mixed together. Punctuation marks largely represent this mixture. Indeed, among the main peculiarities of the novel is the original and personal use of quotation marks. For example, guillemets («,») are frequently used to refer to popular sayings and proverbs as well as to short formulas <ref type="bibr" target="#b32">[33]</ref>, which sometimes intersperse the diegesis, whether introduced by colons or not, and sometimes isolate a complete enunciative section. The long dash (-), instead, has a number of different functions <ref type="bibr" target="#b33">[34]</ref>: one of these is to signal direct speech, but often marking only its beginning and not its end. This leads, on one hand, to a variety of ways of handling parenthetical elements and, on the other hand, to a blurred boundary between the characters' speech, the characters' speech mediated by the narrator, and the narrator's own discourse.</p><p>"Pinocchio", a novel written for a young audience, is characterized by a strongly dialogic style <ref type="bibr" target="#b34">[35]</ref>. For direct speech, including the simulated dialogue between the narrator and the reader, the long dash (-) is abundantly used, but as for "I Malavoglia", the opening dashes are not always accompanied by the closing ones. Additionally, Collodi frequently uses punctuation clusters, specifically the exclamation mark followed by suspension points (!...), at the end of sentences <ref type="bibr" target="#b35">[36]</ref>, a possibility mostly not contemplated by late 19th-century grammars.</p><p>Lastly, Edmondo de Amicis's novel "Cuore" tells the story of a child's school experience from his point of view, adopting a diary-like structure. In "Cuore", the linguistic form is simple and plain: the sentences are mainly short and often end with a standard strong punctuation mark, followed by a capital letter. Direct speech is clearly indicated by long dashes (-), but successive lines of dialogue are arranged consecutively on the page, and in such cases, the closing dash of the previous line also serves as the opening dash of the next line. Since the lines of dialogue are perfectly integrated into the narrative structure, they can end with various punctuation marks, from commas to semicolons to full stops. When the punctuation mark is not strong, after the preliminary conclusion of the line, the text continues with the narrator's discourse.</p><p>Beyond the specific differences listed schematically above, there are also some common typographical and punctuation features among the considered novels. For example, when a closing quotation mark appears with another punctuation mark, the latter in general occurs before the former, as found in "I Promessi Sposi".</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">Conclusions</head><p>This paper presents an assessment of the performance of eight sentence splitting tools adopting different approaches on four 19th-century novels: "I Promessi Sposi" by Alessandro Manzoni, "I Malavoglia" by Giovanni Verga", "Le avventure di Pinocchio" by Carlo Collodi, and "Cuore" by Edmondo de Amicis. Although these texts belong to the same historical period, they show specific features depending on the form and content of the novel as well as the author's stylistic choices. Among these features is punctuation, which in the late 19th century had not reached a detectable stability yet and was rather experiencing a paradigmatic change.</p><p>Since sentence splitting for Western languages, including Italian, relies heavily on punctuation disambiguation, applying existing tools to the four novels considered has resulted in performances well below the standards. These texts demonstrate that sentence splitting is not a completely solved task.</p><p>On the other hand, applying the model retrained on "I Promessi Sposi" to the other three novels showed significant improvements for "Le avventure di Pinocchio" and "I Malavoglia", and a moderate improvement for "Cuore. " This result suggests that shared historical context and belonging to the same textual genre may offer sufficient similarities to improve the model's performance. However, the example of "Cuore" is evidence of how this is sometimes not enough: some specific features in form, punctuation and style continue to affect sentence splitting, demonstrating that although retraining may mitigate some problems, it does not completely overcome the inherent variability of these texts.</p><p>Philologists have increasingly focused on preserving the original punctuation as a part of the author's creation of the text, providing valuable and reliable supports of study for scholars of linguistics and the history of the Italian language. Their combined knowledge is precious for achieving accurate sentence splitting in these texts. Thus, sentence splitting can be an interesting common ground between different disciplines, potentially leading to the development of tools for the automatic analysis of historical literary texts. This field remains under-explored in the Italian context, offering significant opportunities for further study and cross-disciplinary collaboration.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1</head><label>1</label><figDesc>End-of-sentence markers in the test set.</figDesc><table><row><cell>MARK</cell><cell># TOTAL</cell><cell># EOS</cell></row><row><cell>.</cell><cell>277</cell><cell>237</cell></row><row><cell>»</cell><cell>90</cell><cell>53</cell></row><row><cell>?</cell><cell>47</cell><cell>22</cell></row><row><cell>!</cell><cell>31</cell><cell>6</cell></row><row><cell>. . .</cell><cell>23</cell><cell>3</cell></row><row><cell>-</cell><cell>10</cell><cell>3</cell></row><row><cell cols="3">which formally closes the sentence. Lastly, in the novel,</cell></row><row><cell cols="3">suspension points (...) can indicate a sentence bound-</cell></row><row><cell cols="3">ary when they suggest a suspensive allusion or when</cell></row><row><cell cols="3">they mark the interruption of a character's line due to</cell></row><row><cell cols="3">linguistic or extra-linguistic contingencies. In such cases,</cell></row><row><cell cols="3">suspension points' demarcative function is shown either</cell></row><row><cell cols="3">by the following capital letter or by an opening quota-</cell></row><row><cell cols="3">tion mark which indicates the beginning of a different</cell></row><row><cell>character's line.</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2</head><label>2</label><figDesc></figDesc><table /><note>reports the results of our evaluation in terms of F1. The best performance (0.94) is registered with Sentence Splitter, a rule-based system. All other tools do not exceed 0.70, thus having significantly lower performances than those reported on contemporary Italian texts. For example, the official result of UDPipe 2 on the VIT treebank with the 2.12 model starting from a raw text is 0.95, that is almost 30 points more than what is obtained on our test set. The lowest result (0.51) is obtained by the unsupervised WtP system. Although the rule-based approach seems to be the most promising, only Sentence Splitter has an excellent result even without any adaptation of the existing rules.</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 2</head><label>2</label><figDesc></figDesc><table><row><cell cols="3">Results (in terms of F1) of eight systems developed with</cell></row><row><cell cols="3">different approaches: rule-based (RB), supervised (S), semi-</cell></row><row><cell cols="3">supervised (SS) and unsupervised learning (U).</cell></row><row><cell>TYPE</cell><cell>SYSTEM</cell><cell>F1</cell></row><row><cell>RB</cell><cell>spaCy sentencizer</cell><cell>0.61</cell></row><row><cell></cell><cell>CoreNLP 4.5.7 ssplit</cell><cell>0.66</cell></row><row><cell></cell><cell>SentenceSplitter</cell><cell>0.94</cell></row><row><cell>S</cell><cell>UDPipe 2 VIT model</cell><cell>0.66</cell></row><row><cell></cell><cell>Stanza combined</cell><cell>0.69</cell></row><row><cell>SS</cell><cell>Ersatz</cell><cell>0.60</cell></row><row><cell>U</cell><cell>Punkt WtP wtp-canine-s-12l</cell><cell>0.68 0.51</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 4</head><label>4</label><figDesc>Results on about 90 sentences taken from other 19th-century novels. Stanza retr. refers to the model retrained on Manzoni's novel, as described in Section 6.</figDesc><table><row><cell></cell><cell>Malavoglia</cell><cell>Pinocchio</cell><cell>Cuore</cell></row><row><cell>spaCy</cell><cell>0.73</cell><cell>0.35</cell><cell>0.84</cell></row><row><cell>CoreNLP ssplit</cell><cell>0.76</cell><cell>0.72</cell><cell>0.62</cell></row><row><cell>SentenceSplit.</cell><cell>0.77</cell><cell>0.45</cell><cell>0.68</cell></row><row><cell>UDPipe</cell><cell>0.75</cell><cell>0.79</cell><cell>0.67</cell></row><row><cell>Stanza</cell><cell>0.71</cell><cell>0.70</cell><cell>0.61</cell></row><row><cell>Stanza retr.</cell><cell>0.90</cell><cell>0.89</cell><cell>0.69</cell></row><row><cell>Ersatz</cell><cell>0.72</cell><cell>0.75</cell><cell>0.66</cell></row><row><cell>Punkt</cell><cell>0.73</cell><cell>0.77</cell><cell>0.66</cell></row><row><cell>WtP</cell><cell>0.53</cell><cell>0.78</cell><cell>0.39</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">https://github.com/RacheleSprugnoli/Sentence_Splitting_ Manzoni</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_1">https://github.com/mediacloud/sentence-splitter</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_2">https://stanfordnlp.github.io/CoreNLP/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_3">https://github.com/mediacloud/sentence-splitter</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_4">https://ufal.mff.cuni.cz/udpipe</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_5">https://stanfordnlp.github.io/stanza/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_6">https://github.com/rewicks/ersatz</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_7">https://github.com/segment-any-text/wtpsplit</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_8">The text, fully digitized and available online, was collated with the reference edition<ref type="bibr" target="#b28">[29]</ref> prior to analysis, to ensure maximum fidelity to the author's punctuation choices.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="12" xml:id="foot_9">https://universaldependencies.org/release_checklist.html# data-split</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="13" xml:id="foot_10">The output of this process was used to train a new Stanza model as reported in Section 6.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="14" xml:id="foot_11">The reference edition text was used for the analysis of these novels too.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="15" xml:id="foot_12">86 sentences are taken from "I Malavoglia", corresponding to the first chapter of the novel; 93 sentences, that is the first two chapters, come from "Le avventure di Pinocchio"; 87 sentences are taken "Cuore", corresponding to the first three chapters of the novel.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>Questa pubblicazione è stata realizzata da ricercatrice con contratto di ricerca cofinanziato dall'Unione europea -PON Ricerca e Innovazione 2014-2020 ai sensi dell'art. 24, comma 3, lett. a, della Legge 30 dicembre 2010, n. 240 e s.m.i. e del D.M. 10 agosto 2021 n. 1062.</p></div>
			</div>


			<div type="funding">
<div xmlns="http://www.tei-c.org/ns/1.0"> †  <p>This paper is the result of the collaboration between the two authors. For the specific concerns of the Italian academic attribution system: Rachele Sprugnoli is responsible for Sections 2, 3, 6; Arianna Redaelli is responsible for Sections 1, 4, 8. Section 7 were collaboratively written by the two authors.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">I</forename><surname>Bonomi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Masini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Morgana</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Piotti</surname></persName>
		</author>
		<title level="m">Elementi di linguistica italiana</title>
				<imprint>
			<publisher>Carocci</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="volume">103</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Chapter 2: Tokenisation and sentence segmentation, Handbook of natural language processing</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">D</forename><surname>Palmer</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Document parsing: Towards realistic syntactic analysis</title>
		<author>
			<persName><forename type="first">R</forename><surname>Dridan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Oepen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of The 13th International Conference on Parsing Technologies</title>
				<meeting>The 13th International Conference on Parsing Technologies<address><addrLine>IWPT</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2013">2013. 2013</date>
			<biblScope unit="page" from="127" to="133" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Does sentence segmentation matter for machine translation?</title>
		<author>
			<persName><forename type="first">R</forename><surname>Wicks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Post</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Seventh Conference on Machine Translation (WMT)</title>
				<meeting>the Seventh Conference on Machine Translation (WMT)</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="843" to="854" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Impact of automatic sentence segmentation on meeting summarization</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Xie</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2008 IEEE International Conference on Acoustics, Speech and Signal Processing</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="5009" to="5012" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Stanza: A Python natural language processing toolkit for many human languages</title>
		<author>
			<persName><forename type="first">P</forename><surname>Qi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bolton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
		<ptr target="https://nlp.stanford.edu/pubs/qi2020stanza.pdf" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations</title>
				<meeting>the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Converting Italian Treebanks: Towards an Italian Stanford Dependency Treebank</title>
		<author>
			<persName><forename type="first">C</forename><surname>Bosco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Montemagni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Simi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, The Association for Computational Linguistics</title>
				<meeting>the 7th Linguistic Annotation Workshop and Interoperability with Discourse, The Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="61" to="69" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">PoSTWITA-UD: an Italian Twitter treebank in Universal Dependencies</title>
		<author>
			<persName><forename type="first">M</forename><surname>Sanguinetti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Bosco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lavelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mazzei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Antonelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Tamburini</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/L18-1279" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA)</title>
				<editor>
			<persName><forename type="first">N</forename><surname>Calzolari</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Choukri</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Cieri</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Declerck</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Goggi</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Hasida</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Isahara</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">B</forename><surname>Maegaard</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Mariani</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Mazo</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Moreno</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Odijk</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Piperidis</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Tokunaga</surname></persName>
		</editor>
		<meeting>the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA)<address><addrLine>Miyazaki, Japan</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Premessa. Tra punteggiatura e tipografia</title>
		<author>
			<persName><forename type="first">E</forename><surname>Tonani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Il romanzo in bianco e nero. Ricerche sull&apos;uso degli spazi bianchi e dell&apos;interpunzione nella narrativa italiana dall&apos;Ottocento a oggi</title>
				<editor>
			<persName><forename type="first">E</forename><surname>Tonani</surname></persName>
		</editor>
		<meeting><address><addrLine>Firenze</addrLine></address></meeting>
		<imprint>
			<publisher>Franco Cesati</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="13" to="28" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Punteggiatura</title>
		<author>
			<persName><forename type="first">A</forename><surname>Ferrari</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Storia dell&apos;italiano scritto. Grammatiche, volume IV</title>
				<editor>
			<persName><forename type="first">G</forename><surname>Antonelli</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Motolese</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Tomasi</surname></persName>
		</editor>
		<meeting><address><addrLine>Roma</addrLine></address></meeting>
		<imprint>
			<publisher>Carocci</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="169" to="202" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">Mortara</forename><surname>Garavelli</surname></persName>
		</author>
		<title level="m">Prontuario di punteggiatura</title>
				<meeting><address><addrLine>Laterza, Bari</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Manzoni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Ghisalberti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Chiari</surname></persName>
		</author>
		<title level="m">L&apos;ultima revisione dei Promessi Sposi, in: Tutte le opere di Alessandro Manzoni</title>
				<meeting><address><addrLine>Milano</addrLine></address></meeting>
		<imprint>
			<publisher>Mondadori</publisher>
			<date type="published" when="1954">1954</date>
			<biblScope unit="volume">II</biblScope>
			<biblScope unit="page" from="789" to="989" />
		</imprint>
	</monogr>
	<note>I Promessi Sposi</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">UDPipe 2.0 prototype at CoNLL 2018 UD shared task</title>
		<author>
			<persName><forename type="first">M</forename><surname>Straka</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/K18-2020</idno>
		<ptr target="https://aclanthology.org/K18-2020.doi:10.18653/v1/K18-2020" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">D</forename><surname>Zeman</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Hajič</surname></persName>
		</editor>
		<meeting>the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics<address><addrLine>Brussels, Belgium</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="197" to="207" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Universal Dependencies</title>
		<author>
			<persName><forename type="first">M.-C</forename><surname>De Marneffe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Nivre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zeman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational linguistics</title>
		<imprint>
			<biblScope unit="volume">47</biblScope>
			<biblScope unit="page" from="255" to="308" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Unsupervised multilingual sentence boundary detection</title>
		<author>
			<persName><forename type="first">T</forename><surname>Kiss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Strunk</surname></persName>
		</author>
		<idno type="DOI">10.1162/coli.2006.32.4.485</idno>
		<ptr target="https://aclanthology.org/J06-4003.doi:10.1162/coli.2006.32.4.485" />
	</analytic>
	<monogr>
		<title level="j">Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<biblScope unit="page" from="485" to="525" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Using conditional random fields for sentence boundary detection in speech</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Stolcke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Shriberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Harper</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 43rd annual meeting of the Association for Computational Linguistics (ACL&apos;05)</title>
				<meeting>the 43rd annual meeting of the Association for Computational Linguistics (ACL&apos;05)</meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="451" to="458" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Efficient deep learning-based sentence boundary detection in legal text</title>
		<author>
			<persName><forename type="first">R</forename><surname>Sheik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Gokul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Nirmala</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Natural Legal Language Processing Workshop 2022</title>
				<meeting>the Natural Legal Language Processing Workshop 2022</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="208" to="217" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Sentence boundary detection for social media text</title>
		<author>
			<persName><forename type="first">D</forename><surname>Rudrapal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Jamatia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chakma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Gambäck</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 12th International Conference on Natural Language Processing</title>
				<meeting>the 12th International Conference on Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="254" to="260" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">The FinSBD-2019 shared task: Sentence boundary detection in PDF noisy text in the financial domain</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">A</forename><surname>Azzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Bouamor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ferradans</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/W19-5512" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the First Workshop on Financial Technology and Natural Language Processing</title>
				<editor>
			<persName><forename type="first">C.-C</forename><surname>Chen</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H.-H</forename><surname>Huang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Takamura</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H.-H</forename><surname>Chen</surname></persName>
		</editor>
		<meeting>the First Workshop on Financial Technology and Natural Language Processing<address><addrLine>Macao, China</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="74" to="80" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">MultiLegalSBD: a multilingual legal sentence boundary detection dataset</title>
		<author>
			<persName><forename type="first">T</forename><surname>Brugger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Stürmer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Niklaus</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law</title>
				<meeting>the Nineteenth International Conference on Artificial Intelligence and Law</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="42" to="51" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Sentence boundary detection: A long solved problem?</title>
		<author>
			<persName><forename type="first">J</forename><surname>Read</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Dridan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Oepen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">J</forename><surname>Solberg</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/C12-2096" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of COL-ING 2012: Posters, The COLING 2012 Organizing Committee</title>
				<editor>
			<persName><forename type="first">M</forename><surname>Kay</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Boitet</surname></persName>
		</editor>
		<meeting>COL-ING 2012: Posters, The COLING 2012 Organizing Committee<address><addrLine>Mumbai, India</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="985" to="994" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Ferrari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Lala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Longo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Pecorari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Rosi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Stojmenova</surname></persName>
		</author>
		<title level="m">La punteggiatura italiana contemporanea</title>
				<meeting><address><addrLine>Roma</addrLine></address></meeting>
		<imprint>
			<publisher>Carocci</publisher>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note>Un&apos;analisi comunicativo-testuale</note>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">The Stanford CoreNLP natural language processing toolkit</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Surdeanu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bauer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Finkel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bethard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mcclosky</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations</title>
				<meeting>52nd annual meeting of the association for computational linguistics: system demonstrations</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="55" to="60" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Europarl: A parallel corpus for statistical machine translation</title>
		<author>
			<persName><forename type="first">P</forename><surname>Koehn</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2005.mtsummit-papers.11" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of Machine Translation Summit X: Papers</title>
				<meeting>Machine Translation Summit X: Papers<address><addrLine>Phuket, Thailand</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="79" to="86" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">VIT-Venice Italian Treebank: Syntactic and quantitative features</title>
		<author>
			<persName><forename type="first">R</forename><surname>Delmonte</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bristot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Tonelli</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Sixth International Workshop on Treebanks and Linguistic Theories</title>
				<imprint>
			<publisher>Northern European Association for Language Technol</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="43" to="54" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">A unified approach to sentence segmentation of punctuated text in many languages</title>
		<author>
			<persName><forename type="first">R</forename><surname>Wicks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Post</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2021.acl-long.309</idno>
		<ptr target="https://aclanthology.org/2021.acl-long.309.doi:10.18653/v1/2021.acl-long.309" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</title>
				<editor>
			<persName><forename type="first">C</forename><surname>Zong</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Xia</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Navigli</surname></persName>
		</editor>
		<meeting>the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="3995" to="4007" />
		</imprint>
	</monogr>
	<note>: Long Papers), Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Where&apos;s the point? self-supervised multilingual punctuationagnostic sentence segmentation</title>
		<author>
			<persName><forename type="first">B</forename><surname>Minixhofer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pfeiffer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Vulić</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2023.acl-long.398</idno>
		<ptr target="https://aclanthology.org/2023.acl-long.398.doi:10.18653/v1/2023.acl-long.398" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Rogers</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Boyd-Graber</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Okazaki</surname></persName>
		</editor>
		<meeting>the 61st Annual Meeting of the Association for Computational Linguistics<address><addrLine>Toronto, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="7215" to="7235" />
		</imprint>
	</monogr>
	<note>: Long Papers), Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Tint 2.0: an allinclusive suite for NLP in Italian</title>
		<author>
			<persName><forename type="first">A</forename><surname>Palmero Aprosio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Moretti</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)</title>
				<meeting>the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)</meeting>
		<imprint>
			<publisher>Accademia University Press</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="311" to="317" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Edizione genetica della Quarantana</title>
		<author>
			<persName><forename type="first">A</forename><surname>Manzoni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Colli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Promessi</forename><surname>Sposi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Casa del Manzoni</title>
				<meeting><address><addrLine>Milano</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Verga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Cecco</surname></persName>
		</author>
		<author>
			<persName><surname>Malavoglia</surname></persName>
		</author>
		<title level="m">Fondazione Verga-Interlinea</title>
				<meeting><address><addrLine>Catania-Novara</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Collodi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><forename type="middle">Castellani</forename><surname>Pollidori</surname></persName>
		</author>
		<title level="m">Fondazione nazionale Carlo Collodi</title>
				<meeting><address><addrLine>Pescia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1983">1983</date>
		</imprint>
	</monogr>
	<note>Le avventure di Pinocchio</note>
</biblStruct>

<biblStruct xml:id="b31">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><surname>De Amicis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Tamburini</surname></persName>
		</author>
		<title level="m">Cuore. Libro per ragazzi</title>
				<meeting><address><addrLine>Torino</addrLine></address></meeting>
		<imprint>
			<publisher>Einaudi</publisher>
			<date type="published" when="1972">2018. 1972</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">Proverbi, discorso e gesto proverbiale nei «Malavoglia</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">B</forename><surname>Bronzini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">I Malavoglia. Atti del Congresso Internazionale di Studi</title>
				<meeting><address><addrLine>Catania</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1981">26-28 novembre 1981. 1982</date>
			<biblScope unit="page" from="637" to="684" />
		</imprint>
		<respStmt>
			<orgName>Biblioteca della Fondazione Verga</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Il &apos;bianco di dialogato&apos; e il trattamento tipografico del discorso diretto</title>
		<author>
			<persName><forename type="first">E</forename><surname>Tonani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Il romanzo in bianco e nero. Ricerche sull&apos;uso degli spazi bianchi e dell&apos;interpunzione nella narrativa italiana dall&apos;Ottocento a oggi</title>
				<editor>
			<persName><forename type="first">E</forename><surname>Tonani</surname></persName>
		</editor>
		<meeting><address><addrLine>Firenze</addrLine></address></meeting>
		<imprint>
			<publisher>Franco Cesati</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="103" to="136" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">Pinocchio tra dialogo e scrittura</title>
		<author>
			<persName><forename type="first">R</forename><surname>Pellerey</surname></persName>
		</author>
		<ptr target="https://www.jstor.org/stable/26150287" />
	</analytic>
	<monogr>
		<title level="j">Belfagor</title>
		<imprint>
			<biblScope unit="volume">60</biblScope>
			<biblScope unit="page" from="267" to="284" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<analytic>
		<title level="a" type="main">Introduzione</title>
		<author>
			<persName><forename type="first">O</forename><forename type="middle">Castellani</forename><surname>Pollidori</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Le avventure di Pinocchio, Fondazione nazionale Carlo Collodi</title>
				<editor>
			<persName><forename type="first">C</forename><surname>Collodi</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">O</forename><forename type="middle">Castellani</forename><surname>Pollidori</surname></persName>
		</editor>
		<meeting><address><addrLine>Pescia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1983">1983</date>
			<biblScope unit="page" from="XIII" to="LXXXIV" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
