Coherence Evaluation in Italian Language Models
                         Marta Sartor1 , Felice Dell’Orletta1 and Giulia Venturi1
                         1
                             Istituto di Linguistica Computazionale “A. Zampolli”, (ILC-CNR) ItaliaNLP Lab, via G. Moruzzi 1, Pisa, Italy


                                        Abstract
                                        Coherence assessment is central to many NLP tasks, but its evaluation is complex and often done indirectly. In the
                                        LLM era, it is even more crucial to understand how, and how well, these models represent coherence. This study
                                        investigates the effectiveness of small Italian language models (under 1B parameters) in assessing coherence and
                                        focuses on what factors most influence their performance. Our analysis involves 15 Transformer-based LLMs
                                        differing in architecture, parameter size, and training data, and monitors different textual genres and perturbations
                                        used during dataset construction. Two coherence modeling strategies are tested: perplexity and inter-sentence
                                        semantic distance. We show that best practices vary significantly depending on model architecture and approach,
                                        but most importantly on what kind of texts they are applied to, highlighting the nuanced interaction between
                                        textual genre, data perturbation, and model performance.

                                        Keywords
                                        Italian LM, coherence assessment, perplexity, inter-sentence semantic distance


                         1. Introduction
                         Coherence is the meaning connection that binds the components of a text [2] and is fundamental to
                         ensuring the effectiveness of every communicative act. Consequently, in computational linguistics, its
                         analysis is crucial for the resolution of numerous tasks, from identifying the necessary information for
                         question answering [3] to recognizing pathological speech [4, 5], from automatic readability assessment
                         [6] to automatic summary generation [7]. Its critical importance has led to the development of a number
                         of resources (e.g. [8, 9, 10]); for Italian specifically, a new dataset annotated with human judgments of
                         coherence has recently been released (DisCoTex, [11]).
                            It is however notably complex to model coherence computationally, as it does not require explicit
                         linguistic structures to be expressed: it is rather a psychological construct [12], reconstructed implicitly
                         through inferences, general knowledge, co-text, and context [2]. Moreover, its highly subjective nature
                         [11] makes coherence also difficult to assess and evaluate: the soundest approach, and the most direct,
                         would be employing human evaluations, but such data is very costly and lengthy to collect. For this
                         reason, the most common coherence evaluation strategies are by proxy, primarily through the order
                         discrimination task. Its underlying assumption that shuffled texts are less coherent than the original,
                         though sound [13], has shown its limits [14, 12, 15, 16] from the onset. However, its efficiency in terms
                         of resources has encouraged several variations on the original task [17], ranging from altering the
                         number and position of the shuffled sentences [15, 18] to replacing shuffling with substitutions from a
                         closely related document [10].
                            Since the introduction of Transformer models and the paradigm shift brought about by Large Language
                         Models, coherence modeling approaches have changed and shifted in that direction. Many works employ
                         LMs, developing specific models through specialized training [19, 20] or leveraging pretrained models
                         through new indirect approaches [21, 22]. A great deal of attention has since also been devoted to
                         probing these models to more accurately evaluate their ability on coherence assessment [10, 23, 18].

                         Our contribution. We test the ability of small (under 1B parameters) Italian LLMs to model coherence,
                         evaluating them against human judgments on two modeling strategies: perplexity and inter-sentence
                         semantic distance. We were interested in examining how heavily different factors impact model
                         performance, and we directed our analysis on three main aspects:
                          NL4AI 2024: Eighth Workshop on Natural Language for Artificial Intelligence, November 26-27th, 2024, Bolzano, Italy [1]
                          $ marta.sartor@ilc.cnr.it (M. Sartor); felice.dellorletta@ilc.cnr.it (F. Dell’Orletta); giulia.venturi@ilc.cnr.it (G. Venturi)
                                        © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
        • LLM characteristics;
        • textual genre of the target text;
        • textual perturbation applied during dataset construction.
  The first point has a widely known impact and we attempted, compatibly with available models, to
systematically monitor several components. To this end we selected 15 Transformer-based LLMs, all
under 1 billion parameters, differing for architecture, parameter size, target language, and/or training
data size. The literature is also quite clear on the impact that both textual genre and data perturbation
can have, both in the training and the evaluation phase. Nonetheless, it is not always easy to monitor for
these factors, especially due to resource availability. In order to take a deeper look at both these aspects,
we chose to work on the DisCoTex [11] dataset, which contains small paragraphs from two different
genres (TEDx and Wikipedia) and with different degrees of perturbation at the intersentential level
(none, inversion, substitution), where each instance is annotated with human judgments of coherence.


2. Methodology
We tested two different unsupervised approaches to modeling coherence: inter-sentence semantic distance
and perplexity.
   The first approach is a widely used technique to compare vectorial representations of meaning and
was calculated on all models tested, regardless of architecture. The paragraph is first divided into
sentences using the sentence-splitting feature of the Stanza tokenizer, version 1.5.01 . Each sentence is
tokenized and processed by the model, from which a single vector representation of the sentence is
obtained: inter-sentence distance is then calculated between all pairs of consecutive sentences and a
global paragraph value is obtained through a statistical function. Sentence embeddings were calculated
differently on the basis of the model’s architecture: for sentence encoders, the direct output of the model
was taken; for decoders, the last layer representations of each token in the sentence were mean-pooled
into a single vector; for encoder-decoders and encoders, the same process was applied to the encoder’s
last layer, and CLS was also tested as a possible sentence embedding for encoder models.
Being a less straightforward methodology, at each step we tested several variants to broaden the analysis
as much as possible:
        • the measure of distance was calculated both with cosine and Euclidean distance;
        • for encoders, sentence embeddings were represented both through the CLS token and the mean-
          pooling of the sentence tokens;
        • paragraph values were pooled from inter-sentence values by different statistical functions, namely
          mean and standard deviation.
Our choice of statistical function fell firstly of the mean, as a global measure of semantic distance
widely used and recognized in the literature. We chose to add standard deviation, a less commonly used
function, due to the nature of the dataset, where the data is locally perturbed: we felt that perhaps a
local measure of distance could be suited to model such an alteration.
   We also tried modeling coherence in terms of textual plausibility by using perplexity: it is a global
indicator and has already been successfully employed to this end [12, 18]. This method was applied on
all decoders and also on two selected encoder models which allowed a direct comparison with respect
to the target language strategy. On decoder models we calculated perplexity, while with encoder models
we used a plausibility metric so as to be able to compare it with perplexity. Plausibility was calculated
through masked language modeling by masking each token of the paragraph one by one and averaging
their likelihood across the paragraph, as reported below with 𝑛 being the total number of tokens:
                                                                    𝑛
                                                                 1 ∑︁
                                               𝑝𝑙𝑎𝑢𝑠𝑖𝑏𝑖𝑙𝑖𝑡𝑦(𝑋) =      𝑃 (𝑥𝑖 )
                                                                 𝑛
                                                                   𝑖=1
1
    https://pypi.org/project/stanza/1.5.0/#files
   In order to address the impact of textual genre and text perturbation, the analysis of the results
was carried out at various levels of granularity: on the entire dataset, by source, and by perturbation
for each source. Each model was evaluated based on the Spearman correlation of its results with
human judgment. Additionally, the difference in distribution between classes (source of texts or type
of perturbation) was assessed using the Wilcoxon T-test and rank biserial correlation. Evaluating
performance through correlation with human judgment, besides being a more straightforward and
reliable approach, also allows us to effectively counterbalance the possible bias introduced by the fact
that Wikipedia is present both in the pretraining of most models and in our evaluation dataset.
   Baselines were set as random values. Perplexity and plausibility were both assimilated to a probability
distribution and thus we generated random values between 0 and 1. For inter-sentence distance, the
chosen measure of distance was calculated between as many random values as the average length
in sentences of the dataset items, which is 4; the range in which we generated each value changed
based on the measure: -1 to 1 for cosine, and 0 to 1 for Euclidean distance (as if the values had been
normalized, since it has an infinite maximum).

2.1. Dataset
In this work, we used the dataset released by [11], recently integrated into a larger benchmark released
for DiSCoTex [24], a shared task on textual coherence analysis task presented at the 8th evaluation
campaign of NLP and speech tools for the Italian language (EVALITA 2023) [25]. This dataset consists
of 1064 instances, each corresponding to a paragraph of 4-5 sentences and annotated with human
coherence judgments. The data is sourced either from the Italian Wikipedia or the Italian section of the
Multilingual TEDx dataset (which contains TEDx transcripts), to represent different linguistic varieties;
the instances are balanced for source.
   During the dataset construction about 2/3rds of the instances were subjected to alterations that more
or less significantly damaged the internal coherence of the paragraph, to test the effect on human
judgment of some common text perturbation strategies. The alterations were either the inversion of
any two sentences within the paragraph (inversion perturbation), or the replacement of a sentence in
the paragraph with the tenth sentence from the end of the paragraph (substitution perturbation). The
remaining third of the instances was instead left unaltered, to serve as a control group.
   Each paragraph is annotated with human judgment values, corresponding to the mean and standard
deviation of the ratings collected on each instance from at least 10 human annotators. The judgments
were collected through crowdsourcing from native Italian speakers and are expressed on a Likert scale
from 1 to 5, 1 being the lowest coherence score and 5 the highest.
It must be noted, however, that the source or perturbation type differentiates texts significantly on the
basis of the distribution of human coherence judgment they receive (see figures 1 and 2), to the point
that the difference between distributions remains statistically significant even when differentiating
instances for both source and perturbation type (see figure 3). Indeed, the difference increases the
heavier the perturbation applied on the instance, but the source has a much stronger impact than
perturbation. It is also worth noting that the value range in which most Wikipedia texts are located is
far less sparse than that occupied by texts sourced from TEDx.
   For this work, the dataset was integrated with additional data which were not available in the released
version, namely all source and perturbation labels.

2.2. Models
We tested 15 different Transformer-based models, covering the most common architectures (encoder,
decoder, encoder-decoder, sentence encoder).
   Most models are BERT-based to allow better comparability on some of the variations we wanted
to account for: we used a multilingual version (mBERT) and two monolingual Italian versions (BERT-
ita and BERT-ita-xxl) differentiated by dataset size, as well as a sentence encoder (sBERT). Different
architecture sizes were not tested because they were not available for the Italian version of these models;
Figure 1: Mean coherence value           Figure 2: Mean coherence value          Figure 3: Mean coherence value
distribution on the basis of textual     distribution on the basis of            distribution on the basis of genre
genre.                                   perturbation type.                      and perturbation type.


Table 1
Descriptive table of the models tested, highlighting the dimensions that we aim to compare. n.d. means that the
training data size was not declared.
                                     architecture     parameter size     training data (size)      language
         LABSE [26]            sentence encoder                470M                      n.d.   multilingual
         MUSE [27]             sentence encoder                 69M                      n.d.   multilingual
         MUSE large2           sentence encoder                 85M                      n.d.   multilingual
         sBERT3                sentence encoder                111M                      n.d.         italian
         mBERT4                         encoder                179M                      n.d.   multilingual
         BERT-ita5                      encoder                111M                   13GB            italian
         BERT-ita-xxl6                  encoder                111M                   81GB            italian
         XLM-R base [28]                encoder                250M                     2.5T    multilingual
         XLM-R large [28]               encoder                560M                     2.5T    multilingual
         IT5 small [29]         encoder-decoder                60.5M                 215GB            italian
         IT5 base [29]          encoder-decoder                223M                  215GB            italian
         IT5 large [29]         encoder-decoder                770M                  215GB            italian
         GroGPT [30]                    decoder                117M                  13.8GB           italian
         GePpeTto [31]                  decoder                117M                  13.8GB           italian
         Minerva7                       decoder                350M                      n.d.         italian


uncased versions were not tested due to the nature of the Italian language and, most importantly, the
developers’ own recommendations which indicated that the cased version was better.
   Table 1 summarizes the models’ characteristics with respect to our research questions. With "lan-
guage" here we refer to the target language of the models, which does not always coincide with the
language of training data: for example, GroGPT is developed for Italian but is an English GPT-2 model
whose lexical embeddings have been retrained for Italian.


3. Experimental Results
As previously stated, the coherence judgments expressed by people are on a Likert scale from 1 (not very
coherent) to 5 (very coherent). Cosine and Euclidean distances, as well as perplexity, conversely express
greater coherence when the score is lower, so a negative correlation is expected. Plausibility, on the
other hand, has low scores for incoherent texts and high scores for coherent texts like the Likert scale,
being a probability distribution; thus, the direction of the correlation is opposite to those of all other
2
  https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3
3
  https://huggingface.co/nickprock/sentence-bert-base-italian-xxl-uncased
4
  https://github.com/google-research/bert/blob/master/multilingual.md
5
  https://huggingface.co/dbmdz/bert-base-italian-cased
6
  https://huggingface.co/dbmdz/bert-base-italian-xxl-cased
7
  https://huggingface.co/sapienzanlp/Minerva-350M-base-v1.0
Table 2
Spearman correlation of human judgment labels with model predictions, on the entire dataset. Except for
sentence encoders, the sentence pooling strategy is mean-pooling unless stated otherwise. The asterisk indicates
p-value < 0,05.
                                        mean eucl ↓    mean cos ↓     std eucl ↓   std cos
                    baseline                 0,01          -0,07          0,03      -0,03
                    LABSE                  *-0,43         *-0,43        *0,06        0,01
                    MUSE                   *-0,37         *-0,38        *0,09       0,05
                    MUSE large             *-0,38         *-0,39        *0,07       0,03
                    sBERT                  *-0,42         *-0,45          0,01      -0,02
                    mBERT CLS              *-0,14         *-0,14        *-0,06     *-0,09
                    mBERT                  *-0,29         *-0,34         -0,04     *-0,10
                    BERT-ita CLS            -0,02         *-0,14         -0,03     *-0,12
                    BERT-ita-xxl CLS       *-0,20         *-0,24        *-0,09     *-0,15
                    BERT-ita               *-0,28         *-0,32         0,03       -0,05
                    BERT-ita-xxl           *-0,38         *-0,39        *-0,13     *-0,13
                    XLM-R base             *-0,32         *-0,32        *-0,19     *-0,23
                    XLM-R large            *-0,34         *-0,33        *-0,18     *-0,22
                    IT5 small              *-0,38         *-0,37        *-0,17     *-0,19
                    IT5 base               *-0,36         *-0,36        *-0,20     *-0,23
                    IT5 large              *-0,36         *-0,37        *-0,19     *-0,14
                    GroGPT                 *-0,13         *-0,18        *-0,10     *-0,13
                    GePpeTto               *-0.20         *-0.34         -0.05     *-0.10
                    Minerva                *-0,35           0,00        *-0,28     *-0,10


measures. In order to compare plausibility and perplexity, the sign of the correlation with plausibility
has been inverted; we will henceforth refer to it as pseudo-perplexity.
   The analysis of results was performed on three different levels: on the entire dataset (sect. 3.1) and,
to account for genre and perturbation differences, separating instances by their source (sect. 3.2 ) and
for each source by their perturbation type (sect. 3.3).

3.1. Overall Analysis
The Spearman correlation of the models’ prediction and human judgments of coherence is shown in
tables 2 and 3, which cover approaches using inter-sentence distance and (pseudo)perplexity respectively.
Coefficients marked with an asterisk are statistically significant.
Results for each tested methodology and model are always above the baseline, except for the inter-
sentence standard deviation of some models. Globally, the strongest correlation with human judgment
is obtained by perplexity with Minerva (-0.46) and mBERT (-0.45), and by the average inter-sentence
cosine distances calculated with sBERT (-0.45). Among the best are also the average cosine (-0.43) and
Euclidean (-0.43) distances with LaBSE, and the average Euclidean distance with sBERT (-0.42).
   Comparing the different functions of inter-sentence distance, standard deviation appears to be an
unreliable approach: results often lacked correlation with human judgment or had very low coefficients
compared to mean inter-sentence distance, which is also (almost) always statistically significant. Across
all models, the standard deviation averages -0.08 and -0.11 correlation for Euclidean and cosine distance
respectively, while the mean inter-sentence distance averages, respectively, -0.30 and 0.31. These
results also highlight a different trend, which is that Euclidean distance generally obtains lower scores
than cosine distance. This difference is even more pronounced (-0.29 vs -0.32, -0.07 vs -0.11) when
leaving out the one notable exception, Minerva, which despite great results with Euclidean distances
has 0 to non-significant correlation with cosine distance. The best approach however remains using
perplexity or pseudo-perplexity (-0.38), whose average is much higher than what the same models
averaged when using mean inter-sentence distance (Euclidean: -0.27, cosine: -0.25). For what concerns
sentence encodings, embeddings obtained with CLS are significantly worse, not only with respect to
the corresponding mean-pooled embeddings (as already suggested by the literature: see e.g. [32]) but
also to every other model.

Table 3
Spearman correlation between (pseudo)perplexity and human judgment labels, on the entire dataset. The asterisk
indicates p-value < 0,05.
                                                          (P)PPL ↓
                                           baseline           -0,06
                                           mBERT            *-0,45
                                           BERT-ita-xxl     *-0,35
                                           GroGPT           *-0,39
                                           GePpeTto         *-0,27
                                           Minerva          *-0,46

   Among the different architectures, sentence encoders obtain overall the best results: they obtain
the highest correlation scores and there is a great consistency among different models: their average
mean inter-sentence distance coefficient is -0.40 for Euclidean distance and -0.41 for cosine distance,
both higher than the average perplexity score. Their predictions have consistently lower correlation
than other models (or no correlation at all) only when observing standard deviation of inter-sentence
distance. On the other hand IT5, which represents encoder-decoders, has a good correlation with human
judgment both by mean and standard deviation of inter-sentence distance, the latter being statistically
significant and the highest among all models but rather low. The different IT5 versions have, with mean
inter-sentence distance, lower correlation than sentence encoders, but still perform on par with the best
encoder models. Overall, the correlation with human judgment of encoder results appears very variable.
However, the high variability is due to the much lower performance obtained when using CLS, instead
of mean-pooling, as a sentence representation method: this strategy results in little to no significant
correlation with human judgments. Considering only encoders with the mean-pooling strategy, the
average correlation with human judgment is fairly high, although inferior to that of sentence encoders
and IT5. Decoders have the highest variability among results depending on the specific model: Minerva
obtains the best results, but only using Euclidean distance, while methods employing cosine distance
lead to non-significant results; GePpeTto has the lowest perplexity scores but performs well with mean
inter-sentence distance; lastly, GroGPT has a low but consistent performance with both mean and
standard deviation inter-sentence distance but has good perplexity scores.
   For what concerns parameter size, this does not seem to influence results neither comparing different
sizes of the same model (for those where the comparison is possible) nor considering the absolute
parameter size: only with sentence encoder the correlation increases in parallel with parameter size,
but the difference is rather small. Training dataset size, on the other hand, consistently changes the
performance from BERT-ita to BERT-ita-xxl, especially when using CLS.
   It is instead unclear how, or if, target language influences performance: multilingual encoders perform
in the middle between BERT-ita and BERT-ita-xxl when considering inter-sentence distances, hinting
at a possible advantage, and with (pseudo)perplexity multilingual BERT has a correlation score which
is almost on par with that of the much bigger Minerva. In sentence encoders, however, the opposite
seems to be true: sBERT performs comparably to the best multilingual sentence encoder, LABSE, despite
considerable parameter size difference.

3.2. Analysis by source
Tables 4 and 5 show the correlation coefficients of model predictions with human judgments when
dividing the dataset by source. Confirming the results of [24], performance on the TEDx and the
Wikipedia sections is very different, with the first obtaining higher coefficients with all coherence
assessment approaches (with the only exception of the mean inter-sentence cosine distance calculated
with IT5 base). On Wikipedia, the correlation with human judgments is more likely to be not significant
 and its coefficient is stronger than -0.2 only with sentence encoders or through (pseudo)perplexity
 scores: excluding standard deviation of inter-sentence distance, the average Wikipedia coefficient is
-0.12, while for TEDx it is -0.24. This could be influenced by the fact that human coherence judgments
 on Wikipedia texts are more densely distributed in the upper (high-coherence) range, as exemplified in
 figure 1; it is also worth noting that higher coherence values for Wikipedia texts were also produced
 by almost all models regardless of approach, although the entity of this difference varied consistently
 between models.
 In line with what was observed on the entire dataset, sentence encoders remain the best performing
 architecture and (pseudo)perplexity the most effective coherence assessment approach. The unsuitability
 of standard deviation of inter-sentence distance and using CLS as sentence encoder is also further
 confirmed, with cosine distance still obtaining better results than Euclidean distance. Minerva also
 maintains the skewness in results between approaches using Euclidean and cosine distances.
    The combination that achieves the highest correlation with human judgment is perplexity cal-
 culated with Minerva (-0.46 and -0.26 on TEDx and Wikipedia respectively), followed closely by
 pseudo-perplexity calculated with BERT-ita-xxl (-0.43, -0.23); the average inter-sentence cosine distance
 calculated with sBERT (-0.38, -0.24) also obtains satisfying results.
    As we already observed, standard deviation of inter-sentence distance is not a reliable coherence
 indicator: correlation with human judgment is hardly ever significant, and when it is, it is only significant
 on one of the two classes and with very low coefficients. Perplexity and pseudo-perplexity on the other
 hand, with the sole exception of GePpeTto, obtain much higher correlation on TEDx than any other
 approach and keep consistently high coefficients on Wikipedia, where most other performances falter.
 Besides GePpeTto, perplexity results vary significantly between models on TEDx (from 0.32 to 0.46)
 but are mostly identical (from -0.23 to -0.26) on Wikipedia; moreover, the best perplexity results gain a
 noticeable margin from those of inter-sentence distances (0.08) on TEDx, while those of the Wikipedia
 improve only 0.02. This reduced improvement brought about by perplexity, together with the overall
 lower performance on the Wikipedia section, supports our claim that its presence in most training
 datasets is offset by using human coherence judgments, and not perturbation labels, for the evaluation.
    Sentence encoders remain the best performing architecture, not only for their higher correlation
 coefficients on the TEDx section but also and especially for their performance on the Wikipedia section,
which is always significant and higher on average than any other architecture. As was the case on
 the overall dataset, their average score (-0.35 for TEDx and -0.19 for Wikipedia) is close to the average
 perplexity (-0.36 and -0.21 respectively), although this time slightly lower. Sentence encoders also
 appear in general much less sensitive than other architectures to the type of distance used, except for
 sBERT. Encoders maintain a certain variability depending on the models and are still comparable to
 decoders when using perplexity, but this time with inter-sentence distances they perform generally
 better than IT5, for which the Wikipedia class has always low to no correlation with human judgment.
 Decoders, on the other hand, have consistently low performances on inter-sentence distance tasks, in
 contrast with what previously observed; the sole exception is GePpeTto, when leveraging the mean
 cosine distance: once more his leading role as an encoder reverses as a decoder, where he obtains the
 lowest scores.
    For what concerns parameters and training data size, no significant differences were observed. The
 impact of the target language is, however, unclear: overall there seems to be no clear preference
 for either, and the direct comparison between mBERT and BERT-ita/BERT-ita-xxl shows the former
 outperforming the latter on inter-sentence distance tasks and the opposite on pseudo-perplexity.

3.3. Analysis by source and perturbation
As we already observed, the huge differences between TEDx and Wikipedia (both in terms of human
judgment and model behavior) are such that the impact of different kinds of perturbation can only be
observed by keeping the two genres separate. The impact of the aforementioned difference can be very
clearly seen also at this level of analysis: correlation results exhibit strong differences between the TEDx
and Wikipedia sections, both in terms of significance and class distribution. Not only do the TEDx
Table 4
Spearman correlation of human judgment labels with model predictions, dividing the dataset by source. Except
for sentence encoders, the sentence pooling strategy is mean-pooling unless stated otherwise. The asterisk
indicates p-value < 0,05.
                                                    MEAN ↓             STD ↓
                                                 TED      WIKI     TED      WIKI
                         LABSE eucl              *-0,36   *-0,21     0,05     0,07
                         LABSE cos               *-0,37   *-0,20     0,02     0,04
                         MUSE eucl               *-0,33   *-0,18   *0,14     0,05
                         MUSE cos                *-0,33   *-0,17   *0,12     0,02
                         MUSE large eucl         *-0,37   *-0,18   *0,14     0,04
                         MUSE large cos          *-0,37   *-0,18   *0,11     0,02
                         sBERT eucl              *-0,32   *-0,19     0,06     0,07
                         sBERT cos               *-0,38   *-0,24     0,03     0,04
                         mBERT CLS eucl          *-0,14   *-0,10     0,01    -0,03
                         mBERT CLS cos           *-0,13   *-0,12    -0,03    -0,06
                         mBERT eucl              *-0,24   *-0,10    -0,05    0,01
                         mBERT cos               *-0,28   *-0,16    -0,08    -0,03
                         BERT-ita CLS eucl        -0,06    -0,01    -0,06    -0,02
                         BERT-ita CLS cos        *-0,13    -0,06   *-0,11   -0,07
                         BERT-ita-xxl CLS eucl   *-0,09    -0,08     0,04    -0,05
                         BERT-ita-xxl CLS cos    *-0,11    -0,08     0,01    -0,08
                         BERT-ita eucl           *-0,17   *-0,09     0,02     0,05
                         BERT-ita cos            *-0,23   *-0,11     0,00     0,00
                         BERT-ita-xxl eucl       *-0,25    -0,08    -0,03    -0,04
                         BERT-ita-xxl cos        *-0,26   *-0,12    -0,04    -0,04
                         XLM-R base eucl         *-0,19   *-0,11    -0,08    -0,06
                         XLM-R base cos          *-0,18   *-0,12   *-0,10   -0,07
                         XLM-R large eucl        *-0,23   *-0,10   *-0,13   -0,02
                         XLM-R large cos         *-0,22   *-0,10   *-0,16   -0,04
                         IT5 small eucl          *-0,28    -0,07    -0,08    -0,06
                         IT5 small cos           *-0,25   *-0,09    -0,07    -0,06
                         IT5 base eucl           *-0,21   *-0,09    -0,08    -0,08
                         IT5 base cos            *-0,23   *-0,11    -0,07   *-0,12
                         IT5 large eucl          *-0,23    -0,06   *-0,09   -0,04
                         IT5 large cos           *-0,25   *-0,10     0,00    -0,07
                         GroGPT eucl              -0,07    -0,01    -0,07    0,02
                         GroGPT cos               -0,07    -0,05    -0,02    -0,03
                         GePpeTto eucl           *-0.20    -0.05    -0.08    0.00
                         GePpeTto cos            *-0.24   *-0.16   *-0.12   -0.03
                         Minerva eucl            *-0.20    -0.06   *-0.16   -0.06
                         Minerva cos               0.00    -0.03    -0.05    -0.04


section results show higher correlation coefficients, but in the Wikipedia section, most correlations are
not significant. Furthermore, the perturbation class with the highest correlation with human judgment
for TEDx, namely the inversion class, is never significant in the Wikipedia section except when using
(pseudo)perplexity. It is worth noting that how the perturbation classes rank in terms of performance is
not only different between TEDx and Wikipedia, and for Wikipedia between inter-sentence distance-
based methods and perplexity, but also that these differences are not aligned with inter-annotator
agreement. The only common factor between the two sections is the effectiveness of pseudo-perplexity
and sentence encoders and the ineffectiveness of standard deviation of inter-sentence distance. Due to
the significant differences, the two sections are treated separately.
Table 5
Spearman correlation between (pseudo)perplexity and human judgment labels, by source. The asterisk indicates
p-value < 0,05
                                                        (P)PPL ↓
                                                       TED    WIKI
                                      mBERT          * -0,32   * -0,25
                                      BERT-ita-xxl   * -0,43   * -0,23
                                      GroGPT         * -0,34   * -0,24
                                      GePpeTto       * -0,25      -0,09
                                      Minerva        * -0,46   * -0,26


3.3.1. TEDx
The correlation of models’ predictions and human judgment in the TEDx section is shown in tables 6
 and 7, for inter-sentence distance approaches and (pseudo)perplexity respectively. Among the different
 perturbation classes, the inversion class generally has the strongest correlation with human judgment,
 generally followed by the class without alterations and then by the substitution class. Correlation is
 generally statistically significant except for standard deviation of inter-sentence distance measures,
which as we already observed is an unreliable approach. The only other cases of non-significant
 coefficients are with mean inter-sentence distance, mainly in the substitution class and only in a few
 cases in the unaltered class, mostly when using CLS as sentence encoders.
    The highest correlation with human judgment was obtained by Minerva’s perplexity (-0.45 unaltered,
-0.51 inversion, -0.39 substitution), followed by BERT-ita-xxl’s pseudo-perplexity (-0.45, -0.45, -0.33) and
 LABSE’s mean inter-sentence cosine distance (-0.35, -0.42, -0.25).
    As always, perplexity was the approach with highest correlation to human judgment, although with
 considerable internal variability on the basis of the model: it averaged -0.35 for the unaltered class,
-0.39 for the inversion class, and -0.30 for the substitution class. Also in line with previous observations,
 sentence encoders remain the best performing architecture, obtaining consistently high results and
 averaging -0.34, -0.36, and -0.23 for the unaltered, inversion, and substitution classes respectively, not
 too far from the perplexity scores. The role of the model language (multilingual or Italian) remains
 however unclear, following the same patterns observed in the dataset divided by source.
    Generally speaking, this level of analysis is mostly coherent with the previous ones. Some differences
 concern the role of parameter and training size. While parameter size still does not seem relevant in
 absolute terms, this time it has a positive impact when considering different sizes of MUSE and XLM-R
(although it seems almost counterproductive on IT5). Similarly, training data size improves performance
 from BERT-ita to BERT-ita-xxl (especially with mean-pooling), but does not seem to influence other
 models. The difference between Euclidean and cosine distances is also reduced.

3.3.2. Wikipedia
Tables 8 and 9 show the correlation between the model’s predictions and the human coherence judgments.
The most interesting results concern the perturbation classes: the performance ranking of the different
classes is not only different from that of TEDx, but also different between approaches using inter-
sentence distance and using (pseudo)perplexity. In the first case, the substitution class has the highest
correlation with human judgment, while the inversion class performs the worst never being statistically
significant; when using (pseudo)perplexity, on the other hand, the inversion class is the one which
correlates the most with human judgment, followed by the substitution class. Upon further inspection,
on the substitution and unaltered classes there is not much difference between the average performance
using (pseudo)perplexity (-0.15 and -0.22 respectively) or mean inter-sentence distance (-0.14 and -0.17,
excluding outliers like CLS embeddings and cosine Minerva), especially if considering cosine distance (-
0.15 and -0.20). What changes, radically, is performance in the inversion class, going from no correlation
to -0.25.
Table 6
Spearman correlation between human judgments and unsupervised methodologies tested on pretrained models.
Results on the TEDx section of the dataset, grouped by perturbation type. Except for sentence encoders, the
sentence pooling strategy is mean-pooling unless stated otherwise. The asterisk indicates p-value < 0,05.
                                                 MEAN ↓                     STD ↓
                                         no       swap     sub      no      swap     sub
                LABSE eucl              *-0,34   *-0,42   *-0,25    0,02     0,07     0,08
                LABSE cos               *-0,35   *-0,42   *-0,25    -0,03    0,03     0,05
                MUSE eucl               *-0,31   *-0,35   *-0,20    0,07    *0,20     0,07
                MUSE cos                *-0,32   *-0,35   *-0,20    0,04    *0,18     0,06
                MUSE large eucl         *-0,39   *-0,34   *-0,25    0,09     0,13     0,13
                MUSE large cos          *-0,40   *-0,34   *-0,26    0,06     0,11     0,12
                sBERT eucl              *-0,30   *-0,30   *-0,20    -0,08    0,05    *0,20
                sBERT cos               *-0,34   *-0,34   *-0,25    -0,06    -0,03   *0,20
                mBERT CLS eucl           -0,07    -0,14   *-0,21    0,06     0,01     -0,11
                mBERT CLS cos            -0,05    -0,14   *-0,20    0,05     -0,04    -0,14
                mBERT eucl              *-0,23   *-0,20   *-0,25    -0,13    0,02     -0,08
                mBERT cos               *-0,23   *-0,30   *-0,25    -0,10    -0,05    -0,09
                BERT-ita CLS eucl         0,00    -0,08    -0,05    -0,05    0,02    *-0,22
                BERT-ita CLS cos         -0,10   *-0,19    -0,06    -0,09    -0,09   *-0,18
                BERT-ita-xxl CLS eucl     0,00   *-0,22    -0,04    -0,02    0,05     0,07
                BERT-ita-xxl CLS cos     -0,01   *-0,24    -0,05    -0,05    0,00     0,05
                BERT-ita eucl           *-0,19   *-0,16   *-0,16    -0,06    0,04     0,03
                BERT-ita cos            *-0,24   *-0,25   *-0,17    -0,10    -0,01    0,03
                BERT-ita-xxl eucl       *-0,26   *-0,32    -0,14    -0,12    -0,01    0,05
                BERT-ita-xxl cos        *-0,24   *-0,34    -0,14    -0,09    -0,04    0,03
                XLM-R base eucl         *-0,19   *-0,29    -0,05   *-0,17   -0,11     0,03
                XLM-R base cos          *-0,19   *-0,27    -0,04   *-0,17   *-0,15    0,01
                XLM-R large eucl        *-0,22   *-0,32    -0,13   *-0,25   *-0,16   -0,03
                XLM-R large cos         *-0,23   *-0,31    -0,11   *-0,27   *-0,20   -0,05
                IT5 small eucl          *-0,27   *-0,36   *-0,21   *-0,15   -0,06    -0,01
                IT5 small cos           *-0,26   *-0,32    -0,15   *-0,17   -0,04    -0,01
                IT5 base eucl            -0,14   *-0,38    -0,10    0,04    *-0,16    -0,15
                IT5 base cos            *-0,21   *-0,40    -0,06    -0,02    -0,14    -0,09
                IT5 large eucl          *-0,17   *-0,34   *-0,17    -0,03   *-0,18    -0,08
                IT5 large cos           *-0,22   *-0,35   *-0,18    0,03     -0,07    0,00
                GroGPT eucl              -0,02   *-0,18    -0,03    -0,13    -0,11    -0,02
                GroGPT cos               -0,02    -0,12    -0,09    0,03     -0,09    -0,03
                GePpeTto eucl           *-0,20   *-0,16    -0,10    -0,10    -0,07    0,02
                GePpeTto cos            *-0,26   *-0,22    -0,12    -0,13    -0,11    -0,01
                Minerva eucl            *-0,19   *-0,29    -0,10   *-0,17   *-0,22   -0,05
                Minerva cos               0,01     0,03     0,05    -0,04    -0,14    0,11


   There is an overall drop in performance, with an increased number of results that are not statistically
significant. Comparing the different approaches, the pattern is the same as at the other levels of
analysis: standard deviation is the worst, as it is almost never significant, and perplexity performs the
best, especially since it is the only methodology where all three classes are statistically significant.
Moreover, with (pseudo)perplexity all models (except for GePpeTto) always have statistically significant
results, while with mean inter-sentence distance only about half of the results of the unaltered and the
substitution class are statistically significant.
   The highest correlation scores are obtained by Minerva with perplexity (-0.18 unaltered, -0.32
inversion, -0.26 substitution), mBERT with pseudo-perplexity (-0.21, -0.28 and -0.25 respectively),
and sBERT with mean inter-sentence cosine distance (-0.28, -0.07, -0.34).
   Results are in line with what we observed on the overall dataset and considering sources separately;
Table 7
Spearman correlation between (pseudo)perplexity and human judgment labels, by perturbation type on texts
sourced from TEDx. The asterisk indicates coefficients with p-value < 0,05.
                                                          (P)PPL ↓
                                                  no        swap      sub
                                  mBERT          *-0,26    *-0,38    *-0,28
                                  BERT-ita-xxl   *-0,45    *-0,45    *-0,33
                                  GroGPT         *-0,31    *-0,40    *-0,27
                                  GePpeTto       *-0,26    *-0,23    *-0,22
                                  Minerva        *-0,45    *-0,51    *-0,39


sentence encoders, in particular, are the only models that manage to have two statistically significant
classes with distance-based approaches. There is only a slight difference in what concerns the impact
of language. When directly comparing mBERT and BERT-ita and BERT-ita-xxl, the first has better
performances both with inter-sentence distance measures and pseudo-perplexity measures, and overall
multilingual models seem to be performing better (except for sBERT among sentence encoders).


4. Conclusions
We evaluated the coherence assessment abilities of 15 small Italian language models, which varied in
their structural and training-related characteristics, using two unsupervised approaches: modeling
coherence based on inter-sentence semantic distance and perplexity. We evaluated results by their
correlation with human judgment of coherence and analysed our dataset at different levels, to monitor
differences related to the genre of the target text and the perturbation it was subjected to.
   Perplexity and pseudo-perplexity consistently obtain the highest correlation with human judgments
and seem to be the most effective coherence assessment methods. When considering distance measures,
the accuracies obtained with sentence encoders were comparable to those of (pseudo)perplexity. Cosine
distance appeared to be slightly better than Euclidean distance, while sentence embeddings through
CLS and standard deviation of a paragraph’s inter-sentence distance proved to be unsuitable. With
perplexity and pseudo-perplexity the single most impactful decision seemed to be the model, regardless
of parameter size or architecture; conversely, architecture was the most influential factor with inter-
sentence distance approaches, with sentence encoders obtaining by far the best results. This was shown
not only by higher correlation coefficients but also by the very close range of values produced by the
models, underlining the reliability of the approach. Model and training set size did not seem to influence
much performance, while model language (multilingual or Italian) had contradictory results.
   Textual genre was shown to heavily influence model performance, both quantitatively and qualita-
tively, with TEDx always obtaining much higher correlation coefficients than Wikipedia. It is unlikely
that they have been influenced by the presence of Wikipedia in the training, given both the lower
results and the evaluation against human judgments. These results could instead be influenced by the
wider value range in human judgments registered on the former, aiding a ranking-based correlation
measure, which underlines the relevance of considering genre in performance evaluation.
   The impact of different sources is also clear when the effect of perturbations is analyzed. Perturbation
classes not only exhibited markedly different behavior but also had different results depending on
the source of the paragraph. The clearest example is that of the inversion class, which performed
the best on TEDx, while on Wikipedia obtained good results with (pseudo)perplexity but was never
statistically significant with distance measures. Inversions impact order, which is more easily picked
up by a sequence-based metric like perplexity than by a semantically-rooted distance measure. Both
were enough to pick up alterations on TEDx, but only the former was effective on Wikipedia due to
its higher thematic coherence, highlighting the importance of considering the perturbations both in
isolation and in their interaction with other textual characteristics.
Table 8
Spearman correlation between human judgments and unsupervised methodologies tested on pretrained models.
Results on the Wikipedia section of the dataset, grouped by perturbation type. Except for sentence encoders,
the sentence pooling strategy is mean-pooling unless stated otherwise. The asterisk indicates p-value < 0,05.
                                                  MEAN ↓                       STD ↓
                                           no      swap       sub      no      swap    sub
                 LABSE eucl              *-0,22    -0,08     *-0,29    0,11    0,01     0,05
                 LABSE cos               *-0,21    -0,08     *-0,29    0,08    0,01     0,03
                 MUSE eucl               *-0,15    -0,09     *-0,24    0,07    -0,05    0,11
                 MUSE cos                *-0,15    -0,09     *-0,24    0,04    -0,07    0,07
                 MUSE large eucl         *-0,15    -0,05     *-0,27    0,13    -0,04    0,01
                 MUSE large cos           -0,14    -0,05     *-0,27    0,12    -0,04    -0,01
                 sBERT eucl              *-0,28    -0,06     *-0,26    0,10    0,03     0,09
                 sBERT cos               *-0,28    -0,07     *-0,34    0,09    0,00     0,07
                 mBERT CLS eucl           -0,10     0,02     *-0,17   -0,11    0,03     -0,05
                 mBERT CLS cos            -0,12     0,00     *-0,19   -0,13    0,00     -0,07
                 mBERT eucl               -0,11     0,06     *-0,16    0,01    0,02     0,03
                 mBERT cos               *-0,16    -0,02     *-0,26   -0,05    -0,04    0,02
                 BERT-ita CLS eucl        -0,03     0,02      -0,01   -0,09    0,12     -0,07
                 BERT-ita CLS cos         -0,08     0,01      -0,10   -0,12    0,07    *-0,18
                 BERT-ita-xxl CLS eucl   *-0,20     0,00      -0,03   -0,12    -0,01    -0,05
                 BERT-ita-xxl CLS cos    *-0,21     0,00      -0,04   -0,14    -0,03    -0,10
                 BERT-ita eucl            -0,10     0,00      -0,09    0,01    0,07     0,11
                 BERT-ita cos             -0,11     0,01     *-0,16    0,03    0,04     0,02
                 BERT-ita-xxl eucl        -0,14     0,06      -0,13   -0,07    0,07     -0,03
                 BERT-ita-xxl cos         -0,13     0,02     *-0,20   -0,04    0,04     -0,04
                 XLM-R base eucl         *-0,16     0,01     *-0,17   -0,04    -0,03    -0,12
                 XLM-R base cos          *-0,16     0,01     *-0,16   -0,06    -0,03    -0,14
                 XLM-R large eucl        *-0,15     0,00      -0,10   -0,04    0,00     0,04
                 XLM-R large cos         *-0,16     0,00      -0,09   -0,06    0,01     0,01
                 IT5 small eucl           -0,12     0,00      -0,11   -0,06    0,04     -0,13
                 IT5 small cos            -0,11    -0,01      -0,15   -0,07    0,02     -0,09
                 IT5 base eucl            -0,14    -0,03      -0,14   -0,09    0,03    *-0,22
                 IT5 base cos            *-0,15    -0,05     *-0,16   -0,12    -0,03   *-0,23
                 IT5 large eucl           -0,06     0,03     *-0,16   -0,03    0,06     -0,13
                 IT5 large cos            -0,12    -0,01     *-0,17   -0,04    0,01     -0,09
                 GroGPT eucl               0,03    -0,10       0,09    0,04    0,01     0,05
                 GroGPT cos               -0,02    -0,05      -0,04    0,04    -0,05    -0,05
                 GePpeTto eucl            -0,05     0,09      -0,13    0,01    0,02     0,03
                 GePpeTto cos            *-0,16    -0,04     *-0,25   -0,04    -0,02    0,00
                 Minerva eucl             -0,14     0,03      -0,12   -0,09    0,01     -0,14
                 Minerva cos               0,04     0,05      -0,13   -0,07    0,09     -0,13

Table 9
Spearman correlation between (pseudo)perplexity and human judgment labels, by perturbation type on texts
sourced from Wikipedia. The asterisk indicates p-value < 0,05.
                                                           (P)PPL ↓
                                                   no        swap      sub
                                  mBERT           *-0,21    *-0,28    *-0,25
                                  BERT-ita-xxl    *-0,17    *-0,28    *-0,19
                                  GroGPT          *-0,19    *-0,26    *-0,29
                                  GePpeTto         -0,02     -0,09     -0,12
                                  Minerva         *-0,18    *-0,32    *-0,26
Acknowledgments
This paper is supported by the PRIN 2022 PNRR Project P20227PEPK (EKEEL - Empowering Knowledge
Extraction to Empower Learners), funded by the European Union – Next Generation EU, and the LuCET
- LingUistic Complexity Evaluation in educaTion - project under the PRIN grant no. 2022KPNY3B
funded by the Italian Ministry of University and Research.


References
 [1] G. Bonetta, C. D. Hromei, L. Siciliani, M. A. Stranisci, Preface to the Eighth Workshop on Natural
     Language for Artificial Intelligence (NL4AI), in: Proceedings of the Eighth Workshop on Natural
     Language for Artificial Intelligence (NL4AI 2024) co-located with 23th International Conference of
     the Italian Association for Artificial Intelligence (AI*IA 2024), 2024.
 [2] G. L. Beccaria, Dizionario di linguistica e di filologia, metrica, retorica, Einaudi, 2004.
 [3] S. Verberne, L. Boves, N. Oostdijk, P.-A. Coppen, Evaluating discourse-based answer extraction for
     why-question answering, in: Proceedings of the 30th annual international ACM SIGIR conference
     on Research and development in information retrieval, 2007, pp. 735–736.
 [4] B. Elvevåg, P. W. Foltz, D. R. Weinberger, T. E. Goldberg, Quantifying incoherence in speech: an
     automated methodology and novel application to schizophrenia, Schizophrenia research 93 (2007)
     304–316.
 [5] D. Iter, J. Yoon, D. Jurafsky, Automatic detection of incoherent speech for diagnosing schizophrenia,
     in: Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology:
     From Keyboard to Clinic, 2018, pp. 136–146.
 [6] P. Muangkammuen, S. Xu, F. Fukumoto, K. R. Saikaew, J. Li, A neural local coherence analysis model
     for clarity text scoring, in: Proceedings of the 28th international conference on computational
     linguistics, 2020, pp. 2138–2143.
 [7] S. Gerani, Y. Mehdad, G. Carenini, R. Ng, B. Nejat, Abstractive summarization of product reviews
     using discourse structure, in: Proceedings of the 2014 conference on empirical methods in natural
     language processing (EMNLP), 2014, pp. 1602–1613.
 [8] A. Lai, J. Tetreault, Discourse coherence in the wild: A dataset, evaluation and methods, in:
     Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, 2018, pp. 214–223.
 [9] F. S. Mim, N. Inoue, P. Reisert, H. Ouchi, K. Inui, Unsupervised learning of discourse-aware text
     representation for essay scoring, in: Proceedings of the 57th Annual Meeting of the Association
     for Computational Linguistics: Student Research Workshop, 2019, pp. 378–385.
[10] A. Shen, M. Mistica, B. Salehi, H. Li, T. Baldwin, J. Qi, Evaluating document coherence modeling,
     Transactions of the Association for Computational Linguistics 9 (2021) 621–640.
[11] F. Papa, L. Dini, D. Brunato, F. Dell’Orletta, Unraveling text coherence from the human perspective:
     a novel dataset for italian, in: Proceedings of the Ninth Italian Conference on Computational
     Linguistics (CLiC-it 2023), 2023.
[12] A. Beyer, S. Loáiciga, D. Schlangen, Is incoherence surprising? targeted evaluation of coherence
     prediction from language models, in: Proceedings of the 2021 Conference of the North American
     Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021,
     pp. 4164–4173.
[13] Z. Lin, H. T. Ng, M.-Y. Kan, Automatically evaluating text coherence using discourse relations, in:
     Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human
     Language Technologies, 2011, pp. 997–1006.
[14] L. Pishdad, F. Fancellu, R. Zhang, A. Fazly, How coherent are neural models of coherence?,
     in: Proceedings of the 28th International Conference on Computational Linguistics, 2020, pp.
     6126–6138.
[15] H. C. Moon, M. T. Mohiuddin, S. Joty, C. Xu, A unified neural coherence model, in: Proceed-
     ings of the 2019 Conference on Empirical Methods in Natural Language Processing and the
     9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp.
     2262–2272.
[16] M. T. Mohiuddin, P. Jwalapuram, X. Lin, S. Joty, Rethinking coherence modeling: Synthetic
     vs. downstream tasks, in: Proceedings of the 16th Conference of the European Chapter of the
     Association for Computational Linguistics: Main Volume, 2021, pp. 3528–3539.
[17] R. Barzilay, M. Lapata, Modeling local coherence: An entity-based approach, Computational
     Linguistics 34 (2008) 1–34.
[18] P. Laban, L. Dai, L. Bandarkar, M. A. Hearst, Can transformer models measure coherence in
     text: Re-thinking the shuffle test, in: Proceedings of the 59th Annual Meeting of the Association
     for Computational Linguistics and the 11th International Joint Conference on Natural Language
     Processing (Volume 2: Short Papers), 2021, pp. 1058–1064.
[19] D. Iter, K. Guu, L. Lansing, D. Jurafsky, Pretraining with contrastive sentence objectives improves
     discourse performance of language models, in: Proceedings of the 58th Annual Meeting of the
     Association for Computational Linguistics, 2020, pp. 4859–4870.
[20] A. Maimon, R. Tsarfaty, A novel computational and modeling foundation for automatic coherence
     assessment, arXiv preprint arXiv:2310.00598 (2023).
[21] P. Huber, G. Carenini, Towards understanding large-scale discourse structures in pre-trained
     and fine-tuned language models, in: Proceedings of the 2022 Conference of the North American
     Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022,
     pp. 2376–2394.
[22] S. Duari, V. Bhatnagar, Ffcd: A fast-and-frugal coherence detection method, IEEE Access 10 (2021)
     85305–85314.
[23] F. Koto, J. H. Lau, T. Baldwin, Discourse probing of pretrained language models, in: Proceedings
     of the 2021 Conference of the North American Chapter of the Association for Computational
     Linguistics: Human Language Technologies, 2021, pp. 3849–3864.
[24] D. Brunato, D. Colla, F. Dell’Orletta, I. Dini, D. P. Radicioni, A. A. Ravelli, Discotex at evalita
     2023: overview of the assessing discourse coherence in italian texts task, in: Proceedings of the
     Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final
     Workshop (EVALITA 2023). CEUR. org, Parma, Italy, 2023.
[25] M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprugnoli, G. Venturi, Evalita 2023: Overview of
     the 8th evaluation campaign of natural language processing and speech tools for italian, in:
     Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools
     for Italian. Final Workshop (EVALITA 2023), CEUR. org, Parma, Italy, 2023.
[26] F. Feng, Y. Yang, D. Cer, N. Arivazhagan, W. Wang, Language-agnostic BERT sentence embedding,
     in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the
     Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational
     Linguistics, Dublin, Ireland, 2022, pp. 878–891. URL: https://aclanthology.org/2022.acl-long.62.
     doi:10.18653/v1/2022.acl-long.62.
[27] Y. Yang, D. Cer, A. Ahmad, M. Guo, J. Law, N. Constant, G. Hernandez Abrego, S. Yuan, C. Tar, Y.-h.
     Sung, B. Strope, R. Kurzweil, Multilingual universal sentence encoder for semantic retrieval, in:
     A. Celikyilmaz, T.-H. Wen (Eds.), Proceedings of the 58th Annual Meeting of the Association for
     Computational Linguistics: System Demonstrations, Association for Computational Linguistics,
     Online, 2020, pp. 87–94. doi:10.18653/v1/2020.acl-demos.12.
[28] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
     L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, in:
     D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the
     Association for Computational Linguistics, Association for Computational Linguistics, Online,
     2020, pp. 8440–8451. doi:10.18653/v1/2020.acl-main.747.
[29] G. Sarti, M. Nissim, It5: Large-scale text-to-text pretraining for italian language understanding
     and generation, arXiv preprint arXiv:2203.03759 (2022).
[30] W. de Vries, M. Nissim, As good as new. how to successfully recycle English GPT-2 to make models
     for other languages, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Findings of the Association for
     Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics, Online,
     2021, pp. 836–846. doi:10.18653/v1/2021.findings-acl.74.
[31] L. De Mattei, M. Cafagna, F. Dell’Orletta, M. Nissim, M. Guerini, Geppetto carves italian into a
     language model, arXiv preprint arXiv:2004.14253 (2020). doi:10.48550/arXiv.2004.14253.
[32] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using Siamese BERT-networks,
     in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing
     and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),
     2019, pp. 3982–3992.