<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Coherence Evaluation in Italian Language Models</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Marta</forename><surname>Sartor</surname></persName>
							<email>marta.sartor@ilc.cnr.it</email>
							<affiliation key="aff0">
								<orgName type="department">Istituto di Linguistica Computazionale &quot;A. Zampolli&quot;</orgName>
								<orgName type="laboratory">ILC-CNR) ItaliaNLP Lab</orgName>
								<address>
									<addrLine>via G. Moruzzi 1</addrLine>
									<settlement>Pisa</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Felice</forename><surname>Dell'orletta</surname></persName>
							<email>felice.dellorletta@ilc.cnr.it</email>
							<affiliation key="aff0">
								<orgName type="department">Istituto di Linguistica Computazionale &quot;A. Zampolli&quot;</orgName>
								<orgName type="laboratory">ILC-CNR) ItaliaNLP Lab</orgName>
								<address>
									<addrLine>via G. Moruzzi 1</addrLine>
									<settlement>Pisa</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Giulia</forename><surname>Venturi</surname></persName>
							<email>giulia.venturi@ilc.cnr.it</email>
							<affiliation key="aff0">
								<orgName type="department">Istituto di Linguistica Computazionale &quot;A. Zampolli&quot;</orgName>
								<orgName type="laboratory">ILC-CNR) ItaliaNLP Lab</orgName>
								<address>
									<addrLine>via G. Moruzzi 1</addrLine>
									<settlement>Pisa</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Coherence Evaluation in Italian Language Models</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">319C25C4DB294FDBE986DF0D1DAE75D3</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:37+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Italian LM</term>
					<term>coherence assessment</term>
					<term>perplexity</term>
					<term>inter-sentence semantic distance</term>
					<term>03 sBERT *-0</term>
					<term>42 *-0</term>
					<term>45 0</term>
					<term>01 -0</term>
					<term>02 mBERT CLS *-0</term>
					<term>14 *-0</term>
					<term>14 *-0</term>
					<term>06 *-0</term>
					<term>09 mBERT *-0</term>
					<term>29 *-0</term>
					<term>34 -0</term>
					<term>04 *-0</term>
					<term>10 BERT-ita CLS -0</term>
					<term>02 *-0</term>
					<term>14 -0</term>
					<term>03 *-0</term>
					<term>12 BERT-ita-xxl CLS *-0</term>
					<term>20 *-0</term>
					<term>24 *-0</term>
					<term>09 *-0</term>
					<term>15 BERT-ita *-0</term>
					<term>28 *-0</term>
					<term>32 0</term>
					<term>03 -0</term>
					<term>05 BERT-ita-xxl *-0</term>
					<term>38 *-0</term>
					<term>39 *-0</term>
					<term>13 *-0</term>
					<term>13 XLM-R base *-0</term>
					<term>32 *-0</term>
					<term>32 *-0</term>
					<term>19 *-0</term>
					<term>23 XLM-R large *-0</term>
					<term>34 *-0</term>
					<term>33 *-0</term>
					<term>18 *-0</term>
					<term>22 IT5 small *-0</term>
					<term>38 *-0</term>
					<term>37 *-0</term>
					<term>17 *-0</term>
					<term>19 IT5 base *-0</term>
					<term>36 *-0</term>
					<term>36 *-0</term>
					<term>20 *-0</term>
					<term>23 IT5 large *-0</term>
					<term>36 *-0</term>
					<term>37 *-0</term>
					<term>19 *-0</term>
					<term>14 GroGPT *-0</term>
					<term>13 *-0</term>
					<term>18 *-0</term>
					<term>10 *-0</term>
					<term>13 GePpeTto *-0.20 *-0.34 -0.05 *-0.10 Minerva *-0</term>
					<term>35 0</term>
					<term>00 *-0</term>
					<term>28 *-0</term>
					<term>10</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Coherence assessment is central to many NLP tasks, but its evaluation is complex and often done indirectly. In the LLM era, it is even more crucial to understand how, and how well, these models represent coherence. This study investigates the effectiveness of small Italian language models (under 1B parameters) in assessing coherence and focuses on what factors most influence their performance. Our analysis involves 15 Transformer-based LLMs differing in architecture, parameter size, and training data, and monitors different textual genres and perturbations used during dataset construction. Two coherence modeling strategies are tested: perplexity and inter-sentence semantic distance. We show that best practices vary significantly depending on model architecture and approach, but most importantly on what kind of texts they are applied to, highlighting the nuanced interaction between textual genre, data perturbation, and model performance.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Coherence is the meaning connection that binds the components of a text <ref type="bibr" target="#b1">[2]</ref> and is fundamental to ensuring the effectiveness of every communicative act. Consequently, in computational linguistics, its analysis is crucial for the resolution of numerous tasks, from identifying the necessary information for question answering <ref type="bibr" target="#b2">[3]</ref> to recognizing pathological speech <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5]</ref>, from automatic readability assessment <ref type="bibr" target="#b5">[6]</ref> to automatic summary generation <ref type="bibr" target="#b6">[7]</ref>. Its critical importance has led to the development of a number of resources (e.g. <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b9">10]</ref>); for Italian specifically, a new dataset annotated with human judgments of coherence has recently been released (DisCoTex, <ref type="bibr" target="#b10">[11]</ref>).</p><p>It is however notably complex to model coherence computationally, as it does not require explicit linguistic structures to be expressed: it is rather a psychological construct <ref type="bibr" target="#b11">[12]</ref>, reconstructed implicitly through inferences, general knowledge, co-text, and context <ref type="bibr" target="#b1">[2]</ref>. Moreover, its highly subjective nature <ref type="bibr" target="#b10">[11]</ref> makes coherence also difficult to assess and evaluate: the soundest approach, and the most direct, would be employing human evaluations, but such data is very costly and lengthy to collect. For this reason, the most common coherence evaluation strategies are by proxy, primarily through the order discrimination task. Its underlying assumption that shuffled texts are less coherent than the original, though sound <ref type="bibr" target="#b12">[13]</ref>, has shown its limits <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b15">16]</ref> from the onset. However, its efficiency in terms of resources has encouraged several variations on the original task <ref type="bibr" target="#b16">[17]</ref>, ranging from altering the number and position of the shuffled sentences <ref type="bibr" target="#b14">[15,</ref><ref type="bibr" target="#b17">18]</ref> to replacing shuffling with substitutions from a closely related document <ref type="bibr" target="#b9">[10]</ref>.</p><p>Since the introduction of Transformer models and the paradigm shift brought about by Large Language Models, coherence modeling approaches have changed and shifted in that direction. Many works employ LMs, developing specific models through specialized training <ref type="bibr" target="#b18">[19,</ref><ref type="bibr" target="#b19">20]</ref> or leveraging pretrained models through new indirect approaches <ref type="bibr" target="#b20">[21,</ref><ref type="bibr" target="#b21">22]</ref>. A great deal of attention has since also been devoted to probing these models to more accurately evaluate their ability on coherence assessment <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b22">23,</ref><ref type="bibr" target="#b17">18]</ref>.</p><p>• LLM characteristics; • textual genre of the target text; • textual perturbation applied during dataset construction.</p><p>The first point has a widely known impact and we attempted, compatibly with available models, to systematically monitor several components. To this end we selected 15 Transformer-based LLMs, all under 1 billion parameters, differing for architecture, parameter size, target language, and/or training data size. The literature is also quite clear on the impact that both textual genre and data perturbation can have, both in the training and the evaluation phase. Nonetheless, it is not always easy to monitor for these factors, especially due to resource availability. In order to take a deeper look at both these aspects, we chose to work on the DisCoTex <ref type="bibr" target="#b10">[11]</ref> dataset, which contains small paragraphs from two different genres (TEDx and Wikipedia) and with different degrees of perturbation at the intersentential level (none, inversion, substitution), where each instance is annotated with human judgments of coherence.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Methodology</head><p>We tested two different unsupervised approaches to modeling coherence: inter-sentence semantic distance and perplexity.</p><p>The first approach is a widely used technique to compare vectorial representations of meaning and was calculated on all models tested, regardless of architecture. The paragraph is first divided into sentences using the sentence-splitting feature of the Stanza tokenizer, version 1.5.0 1 . Each sentence is tokenized and processed by the model, from which a single vector representation of the sentence is obtained: inter-sentence distance is then calculated between all pairs of consecutive sentences and a global paragraph value is obtained through a statistical function. Sentence embeddings were calculated differently on the basis of the model's architecture: for sentence encoders, the direct output of the model was taken; for decoders, the last layer representations of each token in the sentence were mean-pooled into a single vector; for encoder-decoders and encoders, the same process was applied to the encoder's last layer, and CLS was also tested as a possible sentence embedding for encoder models. Being a less straightforward methodology, at each step we tested several variants to broaden the analysis as much as possible:</p><p>• the measure of distance was calculated both with cosine and Euclidean distance; • for encoders, sentence embeddings were represented both through the CLS token and the meanpooling of the sentence tokens; • paragraph values were pooled from inter-sentence values by different statistical functions, namely mean and standard deviation.</p><p>Our choice of statistical function fell firstly of the mean, as a global measure of semantic distance widely used and recognized in the literature. We chose to add standard deviation, a less commonly used function, due to the nature of the dataset, where the data is locally perturbed: we felt that perhaps a local measure of distance could be suited to model such an alteration. We also tried modeling coherence in terms of textual plausibility by using perplexity: it is a global indicator and has already been successfully employed to this end <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b17">18]</ref>. This method was applied on all decoders and also on two selected encoder models which allowed a direct comparison with respect to the target language strategy. On decoder models we calculated perplexity, while with encoder models we used a plausibility metric so as to be able to compare it with perplexity. Plausibility was calculated through masked language modeling by masking each token of the paragraph one by one and averaging their likelihood across the paragraph, as reported below with 𝑛 being the total number of tokens:</p><p>In order to address the impact of textual genre and text perturbation, the analysis of the results was carried out at various levels of granularity: on the entire dataset, by source, and by perturbation for each source. Each model was evaluated based on the Spearman correlation of its results with human judgment. Additionally, the difference in distribution between classes (source of texts or type of perturbation) was assessed using the Wilcoxon T-test and rank biserial correlation. Evaluating performance through correlation with human judgment, besides being a more straightforward and reliable approach, also allows us to effectively counterbalance the possible bias introduced by the fact that Wikipedia is present both in the pretraining of most models and in our evaluation dataset.</p><p>Baselines were set as random values. Perplexity and plausibility were both assimilated to a probability distribution and thus we generated random values between 0 and 1. For inter-sentence distance, the chosen measure of distance was calculated between as many random values as the average length in sentences of the dataset items, which is 4; the range in which we generated each value changed based on the measure: -1 to 1 for cosine, and 0 to 1 for Euclidean distance (as if the values had been normalized, since it has an infinite maximum).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Dataset</head><p>In this work, we used the dataset released by <ref type="bibr" target="#b10">[11]</ref>, recently integrated into a larger benchmark released for DiSCoTex <ref type="bibr" target="#b23">[24]</ref>, a shared task on textual coherence analysis task presented at the 8th evaluation campaign of NLP and speech tools for the Italian language (EVALITA 2023) <ref type="bibr" target="#b24">[25]</ref>. This dataset consists of 1064 instances, each corresponding to a paragraph of 4-5 sentences and annotated with human coherence judgments. The data is sourced either from the Italian Wikipedia or the Italian section of the Multilingual TEDx dataset (which contains TEDx transcripts), to represent different linguistic varieties; the instances are balanced for source.</p><p>During the dataset construction about 2/3rds of the instances were subjected to alterations that more or less significantly damaged the internal coherence of the paragraph, to test the effect on human judgment of some common text perturbation strategies. The alterations were either the inversion of any two sentences within the paragraph (inversion perturbation), or the replacement of a sentence in the paragraph with the tenth sentence from the end of the paragraph (substitution perturbation). The remaining third of the instances was instead left unaltered, to serve as a control group.</p><p>Each paragraph is annotated with human judgment values, corresponding to the mean and standard deviation of the ratings collected on each instance from at least 10 human annotators. The judgments were collected through crowdsourcing from native Italian speakers and are expressed on a Likert scale from 1 to 5, 1 being the lowest coherence score and 5 the highest. It must be noted, however, that the source or perturbation type differentiates texts significantly on the basis of the distribution of human coherence judgment they receive (see figures 1 and 2), to the point that the difference between distributions remains statistically significant even when differentiating instances for both source and perturbation type (see figure <ref type="figure" target="#fig_2">3</ref>). Indeed, the difference increases the heavier the perturbation applied on the instance, but the source has a much stronger impact than perturbation. It is also worth noting that the value range in which most Wikipedia texts are located is far less sparse than that occupied by texts sourced from TEDx.</p><p>For this work, the dataset was integrated with additional data which were not available in the released version, namely all source and perturbation labels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Models</head><p>We tested 15 different Transformer-based models, covering the most common architectures (encoder, decoder, encoder-decoder, sentence encoder).</p><p>Most models are BERT-based to allow better comparability on some of the variations we wanted to account for: we used a multilingual version (mBERT) and two monolingual Italian versions (BERTita and BERT-ita-xxl) differentiated by dataset size, as well as a sentence encoder (sBERT). Different architecture sizes were not tested because they were not available for the Italian version of these models;     <ref type="foot" target="#foot_4">6</ref>encoder 111M 81GB italian XLM-R base <ref type="bibr" target="#b27">[28]</ref> encoder 250M 2.5T multilingual XLM-R large <ref type="bibr" target="#b27">[28]</ref> encoder 560M 2.5T multilingual IT5 small <ref type="bibr" target="#b28">[29]</ref> encoder-decoder 60.5M 215GB italian IT5 base <ref type="bibr" target="#b28">[29]</ref> encoder-decoder 223M 215GB italian IT5 large <ref type="bibr" target="#b28">[29]</ref> encoder-decoder 770M 215GB italian GroGPT <ref type="bibr" target="#b29">[30]</ref> decoder 117M 13.8GB italian GePpeTto <ref type="bibr" target="#b30">[31]</ref> decoder 117M 13.8GB italian Minerva <ref type="foot" target="#foot_5">7</ref>decoder 350M n.d. italian</p><p>uncased versions were not tested due to the nature of the Italian language and, most importantly, the developers' own recommendations which indicated that the cased version was better. Table <ref type="table" target="#tab_0">1</ref> summarizes the models' characteristics with respect to our research questions. With "language" here we refer to the target language of the models, which does not always coincide with the language of training data: for example, GroGPT is developed for Italian but is an English GPT-2 model whose lexical embeddings have been retrained for Italian.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Experimental Results</head><p>As previously stated, the coherence judgments expressed by people are on a Likert scale from 1 (not very coherent) to 5 (very coherent). Cosine and Euclidean distances, as well as perplexity, conversely express greater coherence when the score is lower, so a negative correlation is expected. Plausibility, on the other hand, has low scores for incoherent texts and high scores for coherent texts like the Likert scale, being a probability distribution; thus, the direction of the correlation is opposite to those of all other</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>Spearman correlation of human judgment labels with model predictions, on the entire dataset. Except for sentence encoders, the sentence pooling strategy is mean-pooling unless stated otherwise. The asterisk indicates p-value &lt; 0,05. mean eucl ↓ mean cos ↓ std eucl ↓ std cos baseline 0,01 -0,07 0,03 -0,03 measures. In order to compare plausibility and perplexity, the sign of the correlation with plausibility has been inverted; we will henceforth refer to it as pseudo-perplexity. The analysis of results was performed on three different levels: on the entire dataset (sect. 3.1) and, to account for genre and perturbation differences, separating instances by their source (sect. 3.2 ) and for each source by their perturbation type (sect. 3.3).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Overall Analysis</head><p>The Spearman correlation of the models' prediction and human judgments of coherence is shown in tables 2 and 3, which cover approaches using inter-sentence distance and (pseudo)perplexity respectively. Coefficients marked with an asterisk are statistically significant. Results for each tested methodology and model are always above the baseline, except for the intersentence standard deviation of some models. Globally, the strongest correlation with human judgment is obtained by perplexity with Minerva (-0.46) and mBERT (-0.45), and by the average inter-sentence cosine distances calculated with sBERT (-0.45). Among the best are also the average cosine (-0.43) and Euclidean (-0.43) distances with LaBSE, and the average Euclidean distance with sBERT (-0.42).</p><p>Comparing the different functions of inter-sentence distance, standard deviation appears to be an unreliable approach: results often lacked correlation with human judgment or had very low coefficients compared to mean inter-sentence distance, which is also (almost) always statistically significant. Across all models, the standard deviation averages -0.08 and -0.11 correlation for Euclidean and cosine distance respectively, while the mean inter-sentence distance averages, respectively, -0.30 and 0.31. These results also highlight a different trend, which is that Euclidean distance generally obtains lower scores than cosine distance. This difference is even more pronounced (-0.29 vs -0.32, -0.07 vs -0.11) when leaving out the one notable exception, Minerva, which despite great results with Euclidean distances has 0 to non-significant correlation with cosine distance. The best approach however remains using perplexity or pseudo-perplexity (-0.38), whose average is much higher than what the same models averaged when using mean inter-sentence distance (Euclidean: -0.27, cosine: -0.25). For what concerns sentence encodings, embeddings obtained with CLS are significantly worse, not only with respect to the corresponding mean-pooled embeddings (as already suggested by the literature: see e.g. <ref type="bibr" target="#b31">[32]</ref>) but also to every other model. Among the different architectures, sentence encoders obtain overall the best results: they obtain the highest correlation scores and there is a great consistency among different models: their average mean inter-sentence distance coefficient is -0.40 for Euclidean distance and -0.41 for cosine distance, both higher than the average perplexity score. Their predictions have consistently lower correlation than other models (or no correlation at all) only when observing standard deviation of inter-sentence distance. On the other hand IT5, which represents encoder-decoders, has a good correlation with human judgment both by mean and standard deviation of inter-sentence distance, the latter being statistically significant and the highest among all models but rather low. The different IT5 versions have, with mean inter-sentence distance, lower correlation than sentence encoders, but still perform on par with the best encoder models. Overall, the correlation with human judgment of encoder results appears very variable. However, the high variability is due to the much lower performance obtained when using CLS, instead of mean-pooling, as a sentence representation method: this strategy results in little to no significant correlation with human judgments. Considering only encoders with the mean-pooling strategy, the average correlation with human judgment is fairly high, although inferior to that of sentence encoders and IT5. Decoders have the highest variability among results depending on the specific model: Minerva obtains the best results, but only using Euclidean distance, while methods employing cosine distance lead to non-significant results; GePpeTto has the lowest perplexity scores but performs well with mean inter-sentence distance; lastly, GroGPT has a low but consistent performance with both mean and standard deviation inter-sentence distance but has good perplexity scores.</p><p>For what concerns parameter size, this does not seem to influence results neither comparing different sizes of the same model (for those where the comparison is possible) nor considering the absolute parameter size: only with sentence encoder the correlation increases in parallel with parameter size, but the difference is rather small. Training dataset size, on the other hand, consistently changes the performance from BERT-ita to BERT-ita-xxl, especially when using CLS.</p><p>It is instead unclear how, or if, target language influences performance: multilingual encoders perform in the middle between BERT-ita and BERT-ita-xxl when considering inter-sentence distances, hinting at a possible advantage, and with (pseudo)perplexity multilingual BERT has a correlation score which is almost on par with that of the much bigger Minerva. In sentence encoders, however, the opposite seems to be true: sBERT performs comparably to the best multilingual sentence encoder, LABSE, despite considerable parameter size difference.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Analysis by source</head><p>Tables <ref type="table" target="#tab_3">4 and 5</ref> show the correlation coefficients of model predictions with human judgments when dividing the dataset by source. Confirming the results of <ref type="bibr" target="#b23">[24]</ref>, performance on the TEDx and the Wikipedia sections is very different, with the first obtaining higher coefficients with all coherence assessment approaches (with the only exception of the mean inter-sentence cosine distance calculated with IT5 base). On Wikipedia, the correlation with human judgments is more likely to be not significant and its coefficient is stronger than -0.2 only with sentence encoders or through (pseudo)perplexity scores: excluding standard deviation of inter-sentence distance, the average Wikipedia coefficient is -0.12, while for TEDx it is -0.24. This could be influenced by the fact that human coherence judgments on Wikipedia texts are more densely distributed in the upper (high-coherence) range, as exemplified in figure <ref type="figure" target="#fig_0">1</ref>; it is also worth noting that higher coherence values for Wikipedia texts were also produced by almost all models regardless of approach, although the entity of this difference varied consistently between models. In line with what was observed on the entire dataset, sentence encoders remain the best performing architecture and (pseudo)perplexity the most effective coherence assessment approach. The unsuitability of standard deviation of inter-sentence distance and using CLS as sentence encoder is also further confirmed, with cosine distance still obtaining better results than Euclidean distance. Minerva also maintains the skewness in results between approaches using Euclidean and cosine distances.</p><p>The combination that achieves the highest correlation with human judgment is perplexity calculated with Minerva (-0.46 and -0.26 on TEDx and Wikipedia respectively), followed closely by pseudo-perplexity calculated with BERT-ita-xxl (-0.43, -0.23); the average inter-sentence cosine distance calculated with sBERT (-0.38, -0.24) also obtains satisfying results.</p><p>As we already observed, standard deviation of inter-sentence distance is not a reliable coherence indicator: correlation with human judgment is hardly ever significant, and when it is, it is only significant on one of the two classes and with very low coefficients. Perplexity and pseudo-perplexity on the other hand, with the sole exception of GePpeTto, obtain much higher correlation on TEDx than any other approach and keep consistently high coefficients on Wikipedia, where most other performances falter. Besides GePpeTto, perplexity results vary significantly between models on TEDx (from 0.32 to 0.46) but are mostly identical (from -0.23 to -0.26) on Wikipedia; moreover, the best perplexity results gain a noticeable margin from those of inter-sentence distances (0.08) on TEDx, while those of the Wikipedia improve only 0.02. This reduced improvement brought about by perplexity, together with the overall lower performance on the Wikipedia section, supports our claim that its presence in most training datasets is offset by using human coherence judgments, and not perturbation labels, for the evaluation.</p><p>Sentence encoders remain the best performing architecture, not only for their higher correlation coefficients on the TEDx section but also and especially for their performance on the Wikipedia section, which is always significant and higher on average than any other architecture. As was the case on the overall dataset, their average score (-0.35 for TEDx and -0.19 for Wikipedia) is close to the average perplexity (-0.36 and -0.21 respectively), although this time slightly lower. Sentence encoders also appear in general much less sensitive than other architectures to the type of distance used, except for sBERT. Encoders maintain a certain variability depending on the models and are still comparable to decoders when using perplexity, but this time with inter-sentence distances they perform generally better than IT5, for which the Wikipedia class has always low to no correlation with human judgment. Decoders, on the other hand, have consistently low performances on inter-sentence distance tasks, in contrast with what previously observed; the sole exception is GePpeTto, when leveraging the mean cosine distance: once more his leading role as an encoder reverses as a decoder, where he obtains the lowest scores.</p><p>For what concerns parameters and training data size, no significant differences were observed. The impact of the target language is, however, unclear: overall there seems to be no clear preference for either, and the direct comparison between mBERT and BERT-ita/BERT-ita-xxl shows the former outperforming the latter on inter-sentence distance tasks and the opposite on pseudo-perplexity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Analysis by source and perturbation</head><p>As we already observed, the huge differences between TEDx and Wikipedia (both in terms of human judgment and model behavior) are such that the impact of different kinds of perturbation can only be observed by keeping the two genres separate. The impact of the aforementioned difference can be very clearly seen also at this level of analysis: correlation results exhibit strong differences between the TEDx and Wikipedia sections, both in terms of significance and class distribution. Not only do the TEDx section results show higher correlation coefficients, but in the Wikipedia section, most correlations are not significant. Furthermore, the perturbation class with the highest correlation with human judgment for TEDx, namely the inversion class, is never significant in the Wikipedia section except when using (pseudo)perplexity. It is worth noting that how the perturbation classes rank in terms of performance is not only different between TEDx and Wikipedia, and for Wikipedia between inter-sentence distancebased methods and perplexity, but also that these differences are not aligned with inter-annotator agreement. The only common factor between the two sections is the effectiveness of pseudo-perplexity and sentence encoders and the ineffectiveness of standard deviation of inter-sentence distance. Due to the significant differences, the two sections are treated separately. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.1.">TEDx</head><p>The correlation of models' predictions and human judgment in the TEDx section is shown in tables 6 and 7, for inter-sentence distance approaches and (pseudo)perplexity respectively. Among the different perturbation classes, the inversion class generally has the strongest correlation with human judgment, generally followed by the class without alterations and then by the substitution class. Correlation is generally statistically significant except for standard deviation of inter-sentence distance measures, which as we already observed is an unreliable approach. The only other cases of non-significant coefficients are with mean inter-sentence distance, mainly in the substitution class and only in a few cases in the unaltered class, mostly when using CLS as sentence encoders.</p><p>The highest correlation with human judgment was obtained by Minerva's perplexity (-0.45 unaltered, -0.51 inversion, -0.39 substitution), followed by BERT-ita-xxl's pseudo-perplexity (-0.45, -0.45, -0.33) and LABSE's mean inter-sentence cosine distance (-0.35, -0.42, -0.25).</p><p>As always, perplexity was the approach with highest correlation to human judgment, although with considerable internal variability on the basis of the model: it averaged -0.35 for the unaltered class, -0.39 for the inversion class, and -0.30 for the substitution class. Also in line with previous observations, sentence encoders remain the best performing architecture, obtaining consistently high results and averaging -0.34, -0.36, and -0.23 for the unaltered, inversion, and substitution classes respectively, not too far from the perplexity scores. The role of the model language (multilingual or Italian) remains however unclear, following the same patterns observed in the dataset divided by source.</p><p>Generally speaking, this level of analysis is mostly coherent with the previous ones. Some differences concern the role of parameter and training size. While parameter size still does not seem relevant in absolute terms, this time it has a positive impact when considering different sizes of MUSE and XLM-R (although it seems almost counterproductive on IT5). Similarly, training data size improves performance from BERT-ita to BERT-ita-xxl (especially with mean-pooling), but does not seem to influence other models. The difference between Euclidean and cosine distances is also reduced.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.2.">Wikipedia</head><p>Tables <ref type="table">8 and 9</ref> show the correlation between the model's predictions and the human coherence judgments. The most interesting results concern the perturbation classes: the performance ranking of the different classes is not only different from that of TEDx, but also different between approaches using intersentence distance and using (pseudo)perplexity. In the first case, the substitution class has the highest correlation with human judgment, while the inversion class performs the worst never being statistically significant; when using (pseudo)perplexity, on the other hand, the inversion class is the one which correlates the most with human judgment, followed by the substitution class. Upon further inspection, on the substitution and unaltered classes there is not much difference between the average performance using (pseudo)perplexity (-0.15 and -0.22 respectively) or mean inter-sentence distance (-0.14 and -0.17, excluding outliers like CLS embeddings and cosine Minerva), especially if considering cosine distance (-0.15 and -0.20). What changes, radically, is performance in the inversion class, going from no correlation to -0.25.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 6</head><p>Spearman correlation between human judgments and unsupervised methodologies tested on pretrained models. Results on the TEDx section of the dataset, grouped by perturbation type. Except for sentence encoders, the sentence pooling strategy is mean-pooling unless stated otherwise. The asterisk indicates p-value &lt; 0,05. -0,01 *-0,24 -0,05 -0,05 0,00 0,05 BERT-ita eucl *-0,19 *-0,16 *-0,16 -0,06 0,04 0,03 BERT-ita cos *-0,24 *-0,25 *-0,17 -0,10 -0,01 0,03 BERT-ita-xxl eucl *-0,26 *-0,32 -0,14 -0,12 -0,01 0,05 BERT-ita-xxl cos *-0,24 *-0,34 -0,14 -0,09 -0,04 0,03 XLM-R base eucl *-0,19 *-0,29 -0,05 *-0,17 -0,11 0,03 XLM-R base cos *-0,19 *-0,27 -0,04 *-0,17 *-0,15 0,01 XLM-R large eucl *-0,22 *-0,32 -0,13 *-0,25 *-0,16 -0,03 XLM-R large cos *-0,23 *-0,31 -0,11 *-0,27 *-0,20 -0,05 IT5 small eucl *-0,27 *-0,36 *-0,21 *-0,15 -0,06 -0,01 IT5 small cos *-0,26 *-0,32 -0,15 *-0,17 -0,04 -0,01 IT5 base eucl -0,14 *-0,38 -0,10 0,04 *-0,16 -0,15 IT5 base cos *-0,21 *-0,40 -0,06 -0,02 -0,14 -0,09 IT5 large eucl *-0,17 *-0,34 *-0,17 -0,03 *-0,18 -0,08 IT5 large cos *-0,22 *-0,35 *-0,18 0,03 -0,07 0,00 GroGPT eucl -0,02 *-0,18 -0,03 -0,13 -0,11 -0,02 GroGPT cos -0,02 -0,12 -0,09 0,03 -0,09 -0,03 GePpeTto eucl *-0,20 *-0,16 -0,10 -0,10 -0,07 0,02 GePpeTto cos *-0,26 *-0,22 -0,12 -0,13 -0,11 -0,01 Minerva eucl *-0,19 *-0,29 -0,10 *-0,17 *-0,22 -0,05 Minerva cos 0,01 0,03 0,05 -0,04 -0,14 0,11</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>MEAN</head><p>There is an overall drop in performance, with an increased number of results that are not statistically significant. Comparing the different approaches, the pattern is the same as at the other levels of analysis: standard deviation is the worst, as it is almost never significant, and perplexity performs the best, especially since it is the only methodology where all three classes are statistically significant. Moreover, with (pseudo)perplexity all models (except for GePpeTto) always have statistically significant results, while with mean inter-sentence distance only about half of the results of the unaltered and the substitution class are statistically significant.</p><p>The highest correlation scores are obtained by Minerva with perplexity (-0.18 unaltered, -0.32 inversion, -0.26 substitution), mBERT with pseudo-perplexity (-0.21, -0.28 and -0.25 respectively), and sBERT with mean inter-sentence cosine distance (-0.28, -0.07, -0.34).</p><p>Results are in line with what we observed on the overall dataset and considering sources separately;  sentence encoders, in particular, are the only models that manage to have two statistically significant classes with distance-based approaches. There is only a slight difference in what concerns the impact of language. When directly comparing mBERT and BERT-ita and BERT-ita-xxl, the first has better performances both with inter-sentence distance measures and pseudo-perplexity measures, and overall multilingual models seem to be performing better (except for sBERT among sentence encoders).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Conclusions</head><p>We evaluated the coherence assessment abilities of 15 small Italian language models, which varied in their structural and training-related characteristics, using two unsupervised approaches: modeling coherence based on inter-sentence semantic distance and perplexity. We evaluated results by their correlation with human judgment of coherence and analysed our dataset at different levels, to monitor differences related to the genre of the target text and the perturbation it was subjected to. Perplexity and pseudo-perplexity consistently obtain the highest correlation with human judgments and seem to be the most effective coherence assessment methods. When considering distance measures, the accuracies obtained with sentence encoders were comparable to those of (pseudo)perplexity. Cosine distance appeared to be slightly better than Euclidean distance, while sentence embeddings through CLS and standard deviation of a paragraph's inter-sentence distance proved to be unsuitable. With perplexity and pseudo-perplexity the single most impactful decision seemed to be the model, regardless of parameter size or architecture; conversely, architecture was the most influential factor with intersentence distance approaches, with sentence encoders obtaining by far the best results. This was shown not only by higher correlation coefficients but also by the very close range of values produced by the models, underlining the reliability of the approach. Model and training set size did not seem to influence much performance, while model language (multilingual or Italian) had contradictory results.</p><p>Textual genre was shown to heavily influence model performance, both quantitatively and qualitatively, with TEDx always obtaining much higher correlation coefficients than Wikipedia. It is unlikely that they have been influenced by the presence of Wikipedia in the training, given both the lower results and the evaluation against human judgments. These results could instead be influenced by the wider value range in human judgments registered on the former, aiding a ranking-based correlation measure, which underlines the relevance of considering genre in performance evaluation.</p><p>The impact of different sources is also clear when the effect of perturbations is analyzed. Perturbation classes not only exhibited markedly different behavior but also had different results depending on the source of the paragraph. The clearest example is that of the inversion class, which performed the best on TEDx, while on Wikipedia obtained good results with (pseudo)perplexity but was never statistically significant with distance measures. Inversions impact order, which is more easily picked up by a sequence-based metric like perplexity than by a semantically-rooted distance measure. Both were enough to pick up alterations on TEDx, but only the former was effective on Wikipedia due to its higher thematic coherence, highlighting the importance of considering the perturbations both in isolation and in their interaction with other textual characteristics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 8</head><p>Spearman correlation between human judgments and unsupervised methodologies tested on pretrained models. Results on the Wikipedia section of the dataset, grouped by perturbation type. Except for sentence encoders, the sentence pooling strategy is mean-pooling unless stated otherwise. The asterisk indicates p-value &lt; 0,05.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>MEAN ↓ STD ↓</head><p>no swap sub no swap sub LABSE eucl *-0,22 -0,08 *-0,29 0,11 0,01 0,05 LABSE cos *-0,21 -0,08 *-0,29 0,08 0,01 0,03 MUSE eucl *-0,15 -0,09 *-0,24 0,07 -0,05 0,11 MUSE cos *-0,15 -0,09 *-0,24 0,04 -0,07 0,07 MUSE large eucl *-0,15 -0,05 *-0,27 0,13 -0,04 0,01 MUSE large cos -0,14 -0,05 *-0,27 0,12 -0,04 -0,01 sBERT eucl *-0,28 -0,06 *-0,26 0,10 0,03 0,09 sBERT cos *-0,28 -0,07 *-0,34 0,09 0,00 0,07 mBERT eucl -0,10 0,02 *-0,17 -0,11 0,03 -0,05 mBERT CLS cos -0,12 0,00 *-0,19 -0,13 0,00 -0,07 mBERT eucl -0,11 0,06 *-0,16 0,01 0,02 0,03 mBERT cos *-0,16 -0,02 *-0,26 -0,05 -0,04 0,02 BERT-ita CLS eucl -0,03 0,02 -0,01 -0,09 0,12 -0,07 BERT-ita CLS cos -0,08 0,01 -0,10 -0,12 0,07 *-0,18 BERT-ita-xxl CLS eucl *-0,20 0,00 -0,03 -0,12 -0,01 -0,05 BERT-ita-xxl CLS cos *-0,21 0,00 -0,04 -0,14 -0,03 -0,10 BERT-ita eucl -0,10 0,00 -0,09 0,01 0,07 0,11 BERT-ita cos -0,11 0,01 *-0,16 0,03 0,04 0,02 BERT-ita-xxl eucl -0,14 0,06 -0,13 -0,07 0,07 -0,03 BERT-ita-xxl cos -0,13 0,02 *-0,20 -0,04 0,04 -0,04 XLM-R base eucl *-0,16 0,01 *-0,17 -0,04 -0,03 -0,12 XLM-R base cos *-0,16 0,01 *-0,16 -0,06 -0,03 -0,14 XLM-R large eucl *-0,15 0,00 -0,10 -0,04 0,00 0,04 XLM-R large cos *-0,16 0,00 -0,09 -0,06 0,01 0,01 IT5 small eucl -0,12 0,00 -0,11 -0,06 0,04 -0,13 IT5 small cos -0,11 -0,01 -0,15 -0,07 0,02 -0,09 IT5 base eucl -0,14 -0,03 -0,14 -0,09 0,03 *-0,22 IT5 base cos *-0,15 -0,05 *-0,16 -0,12 -0,03 *-0,23 IT5 large eucl -0,06 0,03 *-0,16 -0,03 0,06 -0,13 IT5 large cos -0,12 -0,01 *-0,17 -0,04 0,01 -0,09 GroGPT eucl 0,03 -0,10 0,09 0,04 0,01 0,05 GroGPT cos -0,02 -0,05 -0,04 0,04 -0,05 -0,05 GePpeTto eucl -0,05 0,09 -0,13 0,01 0,02 0,03 GePpeTto cos *-0,16 -0,04 *-0,25 -0,04 -0,02 0,00 Minerva eucl -0,14 0,03 -0,12 -0,09 0,01 -0,14 Minerva cos 0,04 0,05 -0,13 -0,07 0,09 -0,13</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 9</head><p>Spearman correlation between (pseudo)perplexity and human judgment labels, by perturbation type on texts sourced from Wikipedia. The asterisk indicates p-value &lt; 0,05. </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Mean coherence value distribution on the basis of textual genre.</figDesc><graphic coords="4,72.00,65.61,145.76,110.77" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Mean coherence value distribution on the basis of perturbation type.</figDesc><graphic coords="4,224.76,65.61,145.76,110.77" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Mean coherence value distribution on the basis of genre and perturbation type.</figDesc><graphic coords="4,377.51,65.61,145.76,110.77" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head></head><label></label><figDesc>BERT-ita-xxl *-0,17 *-0,28 *-0,19 GroGPT *-0,19 *-0,26 *-0,29 GePpeTto -0,02 -0,09 -0,12 Minerva *-0,18 *-0,32 *-0,26</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Descriptive table of the models tested, highlighting the dimensions that we aim to compare. n.d. means that the training data size was not declared.</figDesc><table><row><cell></cell><cell cols="3">architecture parameter size training data (size)</cell><cell>language</cell></row><row><cell>LABSE [26]</cell><cell>sentence encoder</cell><cell>470M</cell><cell cols="2">n.d. multilingual</cell></row><row><cell>MUSE [27]</cell><cell>sentence encoder</cell><cell>69M</cell><cell cols="2">n.d. multilingual</cell></row><row><cell>MUSE large 2</cell><cell>sentence encoder</cell><cell>85M</cell><cell cols="2">n.d. multilingual</cell></row><row><cell>sBERT 3</cell><cell>sentence encoder</cell><cell>111M</cell><cell>n.d.</cell><cell>italian</cell></row><row><cell>mBERT 4</cell><cell>encoder</cell><cell>179M</cell><cell cols="2">n.d. multilingual</cell></row><row><cell>BERT-ita 5</cell><cell>encoder</cell><cell>111M</cell><cell>13GB</cell><cell>italian</cell></row><row><cell>BERT-ita-xxl</cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 3</head><label>3</label><figDesc>Spearman correlation between (pseudo)perplexity and human judgment labels, on the entire dataset. The asterisk indicates p-value &lt; 0,05.</figDesc><table><row><cell></cell><cell>(P)PPL ↓</cell></row><row><cell>baseline</cell><cell>-0,06</cell></row><row><cell>mBERT</cell><cell>*-0,45</cell></row><row><cell>BERT-ita-xxl</cell><cell>*-0,35</cell></row><row><cell>GroGPT</cell><cell>*-0,39</cell></row><row><cell>GePpeTto</cell><cell>*-0,27</cell></row><row><cell>Minerva</cell><cell>*-0,46</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 4</head><label>4</label><figDesc>Spearman correlation of human judgment labels with model predictions, dividing the dataset by source. Except for sentence encoders, the sentence pooling strategy is mean-pooling unless stated otherwise. The asterisk indicates p-value &lt; 0,05.</figDesc><table><row><cell></cell><cell cols="2">MEAN ↓</cell><cell cols="2">STD ↓</cell></row><row><cell></cell><cell>TED</cell><cell>WIKI</cell><cell cols="2">TED WIKI</cell></row><row><cell>LABSE eucl</cell><cell cols="2">*-0,36 *-0,21</cell><cell>0,05</cell><cell>0,07</cell></row><row><cell>LABSE cos</cell><cell cols="2">*-0,37 *-0,20</cell><cell>0,02</cell><cell>0,04</cell></row><row><cell>MUSE eucl</cell><cell cols="3">*-0,33 *-0,18 *0,14</cell><cell>0,05</cell></row><row><cell>MUSE cos</cell><cell cols="3">*-0,33 *-0,17 *0,12</cell><cell>0,02</cell></row><row><cell>MUSE large eucl</cell><cell cols="3">*-0,37 *-0,18 *0,14</cell><cell>0,04</cell></row><row><cell>MUSE large cos</cell><cell cols="3">*-0,37 *-0,18 *0,11</cell><cell>0,02</cell></row><row><cell>sBERT eucl</cell><cell cols="2">*-0,32 *-0,19</cell><cell>0,06</cell><cell>0,07</cell></row><row><cell>sBERT cos</cell><cell cols="2">*-0,38 *-0,24</cell><cell>0,03</cell><cell>0,04</cell></row><row><cell>mBERT CLS eucl</cell><cell cols="2">*-0,14 *-0,10</cell><cell>0,01</cell><cell>-0,03</cell></row><row><cell>mBERT CLS cos</cell><cell cols="2">*-0,13 *-0,12</cell><cell>-0,03</cell><cell>-0,06</cell></row><row><cell>mBERT eucl</cell><cell cols="2">*-0,24 *-0,10</cell><cell>-0,05</cell><cell>0,01</cell></row><row><cell>mBERT cos</cell><cell cols="2">*-0,28 *-0,16</cell><cell>-0,08</cell><cell>-0,03</cell></row><row><cell>BERT-ita CLS eucl</cell><cell>-0,06</cell><cell>-0,01</cell><cell>-0,06</cell><cell>-0,02</cell></row><row><cell>BERT-ita CLS cos</cell><cell>*-0,13</cell><cell>-0,06</cell><cell cols="2">*-0,11 -0,07</cell></row><row><cell cols="2">BERT-ita-xxl CLS eucl *-0,09</cell><cell>-0,08</cell><cell>0,04</cell><cell>-0,05</cell></row><row><cell>BERT-ita-xxl CLS cos</cell><cell>*-0,11</cell><cell>-0,08</cell><cell>0,01</cell><cell>-0,08</cell></row><row><cell>BERT-ita eucl</cell><cell cols="2">*-0,17 *-0,09</cell><cell>0,02</cell><cell>0,05</cell></row><row><cell>BERT-ita cos</cell><cell cols="2">*-0,23 *-0,11</cell><cell>0,00</cell><cell>0,00</cell></row><row><cell>BERT-ita-xxl eucl</cell><cell>*-0,25</cell><cell>-0,08</cell><cell>-0,03</cell><cell>-0,04</cell></row><row><cell>BERT-ita-xxl cos</cell><cell cols="2">*-0,26 *-0,12</cell><cell>-0,04</cell><cell>-0,04</cell></row><row><cell>XLM-R base eucl</cell><cell cols="2">*-0,19 *-0,11</cell><cell>-0,08</cell><cell>-0,06</cell></row><row><cell>XLM-R base cos</cell><cell cols="4">*-0,18 *-0,12 *-0,10 -0,07</cell></row><row><cell>XLM-R large eucl</cell><cell cols="4">*-0,23 *-0,10 *-0,13 -0,02</cell></row><row><cell>XLM-R large cos</cell><cell cols="4">*-0,22 *-0,10 *-0,16 -0,04</cell></row><row><cell>IT5 small eucl</cell><cell>*-0,28</cell><cell>-0,07</cell><cell>-0,08</cell><cell>-0,06</cell></row><row><cell>IT5 small cos</cell><cell cols="2">*-0,25 *-0,09</cell><cell>-0,07</cell><cell>-0,06</cell></row><row><cell>IT5 base eucl</cell><cell cols="2">*-0,21 *-0,09</cell><cell>-0,08</cell><cell>-0,08</cell></row><row><cell>IT5 base cos</cell><cell cols="2">*-0,23 *-0,11</cell><cell cols="2">-0,07 *-0,12</cell></row><row><cell>IT5 large eucl</cell><cell>*-0,23</cell><cell>-0,06</cell><cell cols="2">*-0,09 -0,04</cell></row><row><cell>IT5 large cos</cell><cell cols="2">*-0,25 *-0,10</cell><cell>0,00</cell><cell>-0,07</cell></row><row><cell>GroGPT eucl</cell><cell>-0,07</cell><cell>-0,01</cell><cell>-0,07</cell><cell>0,02</cell></row><row><cell>GroGPT cos</cell><cell>-0,07</cell><cell>-0,05</cell><cell>-0,02</cell><cell>-0,03</cell></row><row><cell>GePpeTto eucl</cell><cell>*-0.20</cell><cell>-0.05</cell><cell>-0.08</cell><cell>0.00</cell></row><row><cell>GePpeTto cos</cell><cell cols="4">*-0.24 *-0.16 *-0.12 -0.03</cell></row><row><cell>Minerva eucl</cell><cell>*-0.20</cell><cell>-0.06</cell><cell cols="2">*-0.16 -0.06</cell></row><row><cell>Minerva cos</cell><cell>0.00</cell><cell>-0.03</cell><cell>-0.05</cell><cell>-0.04</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 5</head><label>5</label><figDesc>Spearman correlation between (pseudo)perplexity and human judgment labels, by source.</figDesc><table><row><cell>The asterisk indicates</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 7</head><label>7</label><figDesc>Spearman correlation between (pseudo)perplexity and human judgment labels, by perturbation type on texts sourced from TEDx. The asterisk indicates coefficients with p-value &lt; 0,05.</figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0">https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1">https://huggingface.co/nickprock/sentence-bert-base-italian-xxl-uncased</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2">https://github.com/google-research/bert/blob/master/multilingual.md</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3">https://huggingface.co/dbmdz/bert-base-italian-cased</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_4">https://huggingface.co/dbmdz/bert-base-italian-xxl-cased</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_5">https://huggingface.co/sapienzanlp/Minerva-350M-base-v1.0</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This paper is supported by the PRIN 2022 PNRR Project P20227PEPK (EKEEL -Empowering Knowledge Extraction to Empower Learners), funded by the European Union -Next Generation EU, and the LuCET -LingUistic Complexity Evaluation in educaTion -project under the PRIN grant no. 2022KPNY3B funded by the Italian Ministry of University and Research.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Preface to the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI)</title>
		<author>
			<persName><forename type="first">G</forename><surname>Bonetta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Hromei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Siciliani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Stranisci</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI 2024) co-located with 23th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2024)</title>
				<meeting>the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI 2024) co-located with 23th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2024)</meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">L</forename><surname>Beccaria</surname></persName>
		</author>
		<title level="m">Dizionario di linguistica e di filologia, metrica, retorica</title>
				<meeting><address><addrLine>Einaudi</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Evaluating discourse-based answer extraction for why-question answering</title>
		<author>
			<persName><forename type="first">S</forename><surname>Verberne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Boves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Oostdijk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P.-A</forename><surname>Coppen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval</title>
				<meeting>the 30th annual international ACM SIGIR conference on Research and development in information retrieval</meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="735" to="736" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Quantifying incoherence in speech: an automated methodology and novel application to schizophrenia</title>
		<author>
			<persName><forename type="first">B</forename><surname>Elvevåg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">W</forename><surname>Foltz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">R</forename><surname>Weinberger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">E</forename><surname>Goldberg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Schizophrenia research</title>
		<imprint>
			<biblScope unit="volume">93</biblScope>
			<biblScope unit="page" from="304" to="316" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Automatic detection of incoherent speech for diagnosing schizophrenia</title>
		<author>
			<persName><forename type="first">D</forename><surname>Iter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yoon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic</title>
				<meeting>the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="136" to="146" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">A neural local coherence analysis model for clarity text scoring</title>
		<author>
			<persName><forename type="first">P</forename><surname>Muangkammuen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Fukumoto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">R</forename><surname>Saikaew</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 28th international conference on computational linguistics</title>
				<meeting>the 28th international conference on computational linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="2138" to="2143" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Abstractive summarization of product reviews using discourse structure</title>
		<author>
			<persName><forename type="first">S</forename><surname>Gerani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Mehdad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Carenini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Nejat</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</title>
				<meeting>the 2014 conference on empirical methods in natural language processing (EMNLP)</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="1602" to="1613" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Discourse coherence in the wild: A dataset, evaluation and methods</title>
		<author>
			<persName><forename type="first">A</forename><surname>Lai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Tetreault</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue</title>
				<meeting>the 19th Annual SIGdial Meeting on Discourse and Dialogue</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="214" to="223" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Unsupervised learning of discourse-aware text representation for essay scoring</title>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">S</forename><surname>Mim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Inoue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Reisert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Ouchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Inui</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop</title>
				<meeting>the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="378" to="385" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Evaluating document coherence modeling</title>
		<author>
			<persName><forename type="first">A</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mistica</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Salehi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Baldwin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Qi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Transactions of the Association for Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="621" to="640" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Unraveling text coherence from the human perspective: a novel dataset for italian</title>
		<author>
			<persName><forename type="first">F</forename><surname>Papa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Dini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Brunato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Dell'orletta</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it</title>
				<meeting>the Ninth Italian Conference on Computational Linguistics (CLiC-it</meeting>
		<imprint>
			<date type="published" when="2023">2023. 2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Is incoherence surprising? targeted evaluation of coherence prediction from language models</title>
		<author>
			<persName><forename type="first">A</forename><surname>Beyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Loáiciga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Schlangen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
				<meeting>the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="4164" to="4173" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Automatically evaluating text coherence using discourse relations</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">T</forename><surname>Ng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-Y</forename><surname>Kan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies</title>
				<meeting>the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="997" to="1006" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">How coherent are neural models of coherence?</title>
		<author>
			<persName><forename type="first">L</forename><surname>Pishdad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Fancellu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fazly</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 28th International Conference on Computational Linguistics</title>
				<meeting>the 28th International Conference on Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="6126" to="6138" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">A unified neural coherence model</title>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">C</forename><surname>Moon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">T</forename><surname>Mohiuddin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Joty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Xu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</title>
				<meeting>the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="2262" to="2272" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Rethinking coherence modeling: Synthetic vs. downstream tasks</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">T</forename><surname>Mohiuddin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Jwalapuram</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Joty</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume</title>
				<meeting>the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="3528" to="3539" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Modeling local coherence: An entity-based approach</title>
		<author>
			<persName><forename type="first">R</forename><surname>Barzilay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lapata</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="1" to="34" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Can transformer models measure coherence in text: Re-thinking the shuffle test</title>
		<author>
			<persName><forename type="first">P</forename><surname>Laban</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Bandarkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Hearst</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</title>
		<title level="s">Short Papers</title>
		<meeting>the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="1058" to="1064" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Pretraining with contrastive sentence objectives improves discourse performance of language models</title>
		<author>
			<persName><forename type="first">D</forename><surname>Iter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Guu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Lansing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the 58th Annual Meeting of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="4859" to="4870" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Maimon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Tsarfaty</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2310.00598</idno>
		<title level="m">A novel computational and modeling foundation for automatic coherence assessment</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Towards understanding large-scale discourse structures in pre-trained and fine-tuned language models</title>
		<author>
			<persName><forename type="first">P</forename><surname>Huber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Carenini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
				<meeting>the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="2376" to="2394" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Ffcd: A fast-and-frugal coherence detection method</title>
		<author>
			<persName><forename type="first">S</forename><surname>Duari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Bhatnagar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Access</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="85305" to="85314" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Discourse probing of pretrained language models</title>
		<author>
			<persName><forename type="first">F</forename><surname>Koto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">H</forename><surname>Lau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Baldwin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
				<meeting>the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="3849" to="3864" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Discotex at evalita 2023: overview of the assessing discourse coherence in italian texts task</title>
		<author>
			<persName><forename type="first">D</forename><surname>Brunato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Colla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Dell'orletta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Dini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">P</forename><surname>Radicioni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">A</forename><surname>Ravelli</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop</title>
				<meeting>the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop<address><addrLine>EVALITA; Parma, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023. 2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Evalita 2023: Overview of the 8th evaluation campaign of natural language processing and speech tools for italian</title>
		<author>
			<persName><forename type="first">M</forename><surname>Lai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Menini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Polignano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Russo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Sprugnoli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Venturi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023)</title>
				<meeting>the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023)<address><addrLine>Parma, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Language-agnostic BERT sentence embedding</title>
		<author>
			<persName><forename type="first">F</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Cer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Arivazhagan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2022.acl-long.62</idno>
		<ptr target="https://aclanthology.org/2022.acl-long.62.doi:10.18653/v1/2022.acl-long.62" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics</title>
		<title level="s">Long Papers</title>
		<editor>
			<persName><forename type="first">S</forename><surname>Muresan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Nakov</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Villavicencio</surname></persName>
		</editor>
		<meeting>the 60th Annual Meeting of the Association for Computational Linguistics<address><addrLine>Dublin, Ireland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="878" to="891" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Multilingual universal sentence encoder for semantic retrieval</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Cer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ahmad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Law</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Constant</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Hernandez Abrego</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Tar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>-H. Sung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Strope</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kurzweil</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2020.acl-demos.12</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Celikyilmaz</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T.-H</forename><surname>Wen</surname></persName>
		</editor>
		<meeting>the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="87" to="94" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Unsupervised cross-lingual representation learning at scale</title>
		<author>
			<persName><forename type="first">A</forename><surname>Conneau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Khandelwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Chaudhary</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Wenzek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Guzmán</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Grave</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ott</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zettlemoyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Stoyanov</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2020.acl-main.747</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Chai</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Schluter</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Tetreault</surname></persName>
		</editor>
		<meeting>the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="8440" to="8451" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Sarti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Nissim</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2203.03759</idno>
		<title level="m">It5: Large-scale text-to-text pretraining for italian language understanding and generation</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">As good as new. how to successfully recycle English GPT-2 to make models for other languages</title>
		<author>
			<persName><forename type="first">W</forename><surname>Vries</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Nissim</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2021.findings-acl.74</idno>
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">C</forename><surname>Zong</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Xia</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Navigli</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="836" to="846" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">De</forename><surname>Mattei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cafagna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Dell'orletta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Nissim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Guerini</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2004.14253</idno>
		<idno type="arXiv">arXiv:2004.14253</idno>
		<title level="m">Geppetto carves italian into a language model</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Sentence-BERT: Sentence embeddings using Siamese BERT-networks</title>
		<author>
			<persName><forename type="first">N</forename><surname>Reimers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Gurevych</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</title>
				<meeting>the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="3982" to="3992" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
