<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Coherence Evaluation in Italian Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marta Sartor</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felice Dell'Orletta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giulia Venturi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Istituto di Linguistica Computazionale “A. Zampolli”, (ILC-CNR) ItaliaNLP Lab</institution>
          ,
          <addr-line>via G. Moruzzi 1, Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Coherence assessment is central to many NLP tasks, but its evaluation is complex and often done indirectly. In the LLM era, it is even more crucial to understand how, and how well, these models represent coherence. This study investigates the efectiveness of small Italian language models (under 1B parameters) in assessing coherence and focuses on what factors most influence their performance. Our analysis involves 15 Transformer-based LLMs difering in architecture, parameter size, and training data, and monitors diferent textual genres and perturbations used during dataset construction. Two coherence modeling strategies are tested: perplexity and inter-sentence semantic distance. We show that best practices vary significantly depending on model architecture and approach, but most importantly on what kind of texts they are applied to, highlighting the nuanced interaction between textual genre, data perturbation, and model performance.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Italian LM</kwd>
        <kwd>coherence assessment</kwd>
        <kwd>perplexity</kwd>
        <kwd>inter-sentence semantic distance</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Coherence is the meaning connection that binds the components of a text [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and is fundamental to
ensuring the efectiveness of every communicative act. Consequently, in computational linguistics, its
analysis is crucial for the resolution of numerous tasks, from identifying the necessary information for
question answering [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to recognizing pathological speech [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], from automatic readability assessment
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to automatic summary generation [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Its critical importance has led to the development of a number
of resources (e.g. [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
        ]); for Italian specifically, a new dataset annotated with human judgments of
coherence has recently been released (DisCoTex, [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]).
      </p>
      <p>
        It is however notably complex to model coherence computationally, as it does not require explicit
linguistic structures to be expressed: it is rather a psychological construct [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], reconstructed implicitly
through inferences, general knowledge, co-text, and context [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Moreover, its highly subjective nature
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] makes coherence also dificult to assess and evaluate: the soundest approach, and the most direct,
would be employing human evaluations, but such data is very costly and lengthy to collect. For this
reason, the most common coherence evaluation strategies are by proxy, primarily through the order
discrimination task. Its underlying assumption that shufled texts are less coherent than the original,
though sound [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], has shown its limits [
        <xref ref-type="bibr" rid="ref12 ref14 ref15 ref16">14, 12, 15, 16</xref>
        ] from the onset. However, its eficiency in terms
of resources has encouraged several variations on the original task [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], ranging from altering the
number and position of the shufled sentences [
        <xref ref-type="bibr" rid="ref15 ref18">15, 18</xref>
        ] to replacing shufling with substitutions from a
closely related document [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Since the introduction of Transformer models and the paradigm shift brought about by Large Language
Models, coherence modeling approaches have changed and shifted in that direction. Many works employ
LMs, developing specific models through specialized training [
        <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
        ] or leveraging pretrained models
through new indirect approaches [
        <xref ref-type="bibr" rid="ref21 ref22">21, 22</xref>
        ]. A great deal of attention has since also been devoted to
probing these models to more accurately evaluate their ability on coherence assessment [
        <xref ref-type="bibr" rid="ref10 ref18 ref23">10, 23, 18</xref>
        ].
Our contribution. We test the ability of small (under 1B parameters) Italian LLMs to model coherence,
evaluating them against human judgments on two modeling strategies: perplexity and inter-sentence
semantic distance. We were interested in examining how heavily diferent factors impact model
performance, and we directed our analysis on three main aspects:
• LLM characteristics;
• textual genre of the target text;
• textual perturbation applied during dataset construction.
      </p>
      <p>
        The first point has a widely known impact and we attempted, compatibly with available models, to
systematically monitor several components. To this end we selected 15 Transformer-based LLMs, all
under 1 billion parameters, difering for architecture, parameter size, target language, and/or training
data size. The literature is also quite clear on the impact that both textual genre and data perturbation
can have, both in the training and the evaluation phase. Nonetheless, it is not always easy to monitor for
these factors, especially due to resource availability. In order to take a deeper look at both these aspects,
we chose to work on the DisCoTex [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] dataset, which contains small paragraphs from two diferent
genres (TEDx and Wikipedia) and with diferent degrees of perturbation at the intersentential level
(none, inversion, substitution), where each instance is annotated with human judgments of coherence.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>We tested two diferent unsupervised approaches to modeling coherence: inter-sentence semantic distance
and perplexity.</p>
      <p>The first approach is a widely used technique to compare vectorial representations of meaning and
was calculated on all models tested, regardless of architecture. The paragraph is first divided into
sentences using the sentence-splitting feature of the Stanza tokenizer, version 1.5.01. Each sentence is
tokenized and processed by the model, from which a single vector representation of the sentence is
obtained: inter-sentence distance is then calculated between all pairs of consecutive sentences and a
global paragraph value is obtained through a statistical function. Sentence embeddings were calculated
diferently on the basis of the model’s architecture: for sentence encoders, the direct output of the model
was taken; for decoders, the last layer representations of each token in the sentence were mean-pooled
into a single vector; for encoder-decoders and encoders, the same process was applied to the encoder’s
last layer, and CLS was also tested as a possible sentence embedding for encoder models.
Being a less straightforward methodology, at each step we tested several variants to broaden the analysis
as much as possible:
• the measure of distance was calculated both with cosine and Euclidean distance;
• for encoders, sentence embeddings were represented both through the CLS token and the
meanpooling of the sentence tokens;
• paragraph values were pooled from inter-sentence values by diferent statistical functions, namely
mean and standard deviation.</p>
      <p>Our choice of statistical function fell firstly of the mean, as a global measure of semantic distance
widely used and recognized in the literature. We chose to add standard deviation, a less commonly used
function, due to the nature of the dataset, where the data is locally perturbed: we felt that perhaps a
local measure of distance could be suited to model such an alteration.</p>
      <p>
        We also tried modeling coherence in terms of textual plausibility by using perplexity: it is a global
indicator and has already been successfully employed to this end [
        <xref ref-type="bibr" rid="ref12 ref18">12, 18</xref>
        ]. This method was applied on
all decoders and also on two selected encoder models which allowed a direct comparison with respect
to the target language strategy. On decoder models we calculated perplexity, while with encoder models
we used a plausibility metric so as to be able to compare it with perplexity. Plausibility was calculated
through masked language modeling by masking each token of the paragraph one by one and averaging
their likelihood across the paragraph, as reported below with  being the total number of tokens:
() = 1 ∑︁  ()
 =1
      </p>
      <sec id="sec-2-1">
        <title>1https://pypi.org/project/stanza/1.5.0/#files</title>
        <p>In order to address the impact of textual genre and text perturbation, the analysis of the results
was carried out at various levels of granularity: on the entire dataset, by source, and by perturbation
for each source. Each model was evaluated based on the Spearman correlation of its results with
human judgment. Additionally, the diference in distribution between classes (source of texts or type
of perturbation) was assessed using the Wilcoxon T-test and rank biserial correlation. Evaluating
performance through correlation with human judgment, besides being a more straightforward and
reliable approach, also allows us to efectively counterbalance the possible bias introduced by the fact
that Wikipedia is present both in the pretraining of most models and in our evaluation dataset.</p>
        <p>Baselines were set as random values. Perplexity and plausibility were both assimilated to a probability
distribution and thus we generated random values between 0 and 1. For inter-sentence distance, the
chosen measure of distance was calculated between as many random values as the average length
in sentences of the dataset items, which is 4; the range in which we generated each value changed
based on the measure: -1 to 1 for cosine, and 0 to 1 for Euclidean distance (as if the values had been
normalized, since it has an infinite maximum).</p>
        <sec id="sec-2-1-1">
          <title>2.1. Dataset</title>
          <p>
            In this work, we used the dataset released by [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ], recently integrated into a larger benchmark released
for DiSCoTex [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ], a shared task on textual coherence analysis task presented at the 8th evaluation
campaign of NLP and speech tools for the Italian language (EVALITA 2023) [
            <xref ref-type="bibr" rid="ref25">25</xref>
            ]. This dataset consists
of 1064 instances, each corresponding to a paragraph of 4-5 sentences and annotated with human
coherence judgments. The data is sourced either from the Italian Wikipedia or the Italian section of the
Multilingual TEDx dataset (which contains TEDx transcripts), to represent diferent linguistic varieties;
the instances are balanced for source.
          </p>
          <p>During the dataset construction about 2/3rds of the instances were subjected to alterations that more
or less significantly damaged the internal coherence of the paragraph, to test the efect on human
judgment of some common text perturbation strategies. The alterations were either the inversion of
any two sentences within the paragraph (inversion perturbation), or the replacement of a sentence in
the paragraph with the tenth sentence from the end of the paragraph (substitution perturbation). The
remaining third of the instances was instead left unaltered, to serve as a control group.</p>
          <p>Each paragraph is annotated with human judgment values, corresponding to the mean and standard
deviation of the ratings collected on each instance from at least 10 human annotators. The judgments
were collected through crowdsourcing from native Italian speakers and are expressed on a Likert scale
from 1 to 5, 1 being the lowest coherence score and 5 the highest.</p>
          <p>It must be noted, however, that the source or perturbation type diferentiates texts significantly on the
basis of the distribution of human coherence judgment they receive (see figures 1 and 2), to the point
that the diference between distributions remains statistically significant even when diferentiating
instances for both source and perturbation type (see figure 3). Indeed, the diference increases the
heavier the perturbation applied on the instance, but the source has a much stronger impact than
perturbation. It is also worth noting that the value range in which most Wikipedia texts are located is
far less sparse than that occupied by texts sourced from TEDx.</p>
          <p>For this work, the dataset was integrated with additional data which were not available in the released
version, namely all source and perturbation labels.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.2. Models</title>
          <p>We tested 15 diferent Transformer-based models, covering the most common architectures (encoder,
decoder, encoder-decoder, sentence encoder).</p>
          <p>
            Most models are BERT-based to allow better comparability on some of the variations we wanted
to account for: we used a multilingual version (mBERT) and two monolingual Italian versions
(BERTita and BERT-ita-xxl) diferentiated by dataset size, as well as a sentence encoder (sBERT). Diferent
architecture sizes were not tested because they were not available for the Italian version of these models;
LABSE [
            <xref ref-type="bibr" rid="ref26">26</xref>
            ]
MUSE [
            <xref ref-type="bibr" rid="ref27">27</xref>
            ]
MUSE large2
sBERT3
mBERT4
BERT-ita5
BERT-ita-xxl6
XLM-R base [
            <xref ref-type="bibr" rid="ref28">28</xref>
            ]
XLM-R large [
            <xref ref-type="bibr" rid="ref28">28</xref>
            ]
IT5 small [
            <xref ref-type="bibr" rid="ref29">29</xref>
            ]
IT5 base [
            <xref ref-type="bibr" rid="ref29">29</xref>
            ]
IT5 large [
            <xref ref-type="bibr" rid="ref29">29</xref>
            ]
GroGPT [
            <xref ref-type="bibr" rid="ref30">30</xref>
            ]
GePpeTto [31]
Minerva7
sentence encoder
sentence encoder
sentence encoder
sentence encoder
encoder
encoder
encoder
encoder
encoder
encoder-decoder
encoder-decoder
encoder-decoder
decoder
decoder
decoder
470M
69M
85M
111M
179M
111M
111M
250M
560M
60.5M
223M
770M
117M
117M
350M
n.d.
n.d.
n.d.
n.d.
          </p>
          <p>n.d.
13GB
81GB
2.5T
2.5T
215GB
215GB
215GB
13.8GB
13.8GB
n.d.</p>
          <p>language
multilingual
multilingual
multilingual</p>
          <p>italian
multilingual
italian
italian
multilingual
multilingual
italian
italian
italian
italian
italian
italian
uncased versions were not tested due to the nature of the Italian language and, most importantly, the
developers’ own recommendations which indicated that the cased version was better.</p>
          <p>Table 1 summarizes the models’ characteristics with respect to our research questions. With
"language" here we refer to the target language of the models, which does not always coincide with the
language of training data: for example, GroGPT is developed for Italian but is an English GPT-2 model
whose lexical embeddings have been retrained for Italian.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Results</title>
      <p>As previously stated, the coherence judgments expressed by people are on a Likert scale from 1 (not very
coherent) to 5 (very coherent). Cosine and Euclidean distances, as well as perplexity, conversely express
greater coherence when the score is lower, so a negative correlation is expected. Plausibility, on the
other hand, has low scores for incoherent texts and high scores for coherent texts like the Likert scale,
being a probability distribution; thus, the direction of the correlation is opposite to those of all other</p>
      <sec id="sec-3-1">
        <title>2https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3</title>
        <p>3https://huggingface.co/nickprock/sentence-bert-base-italian-xxl-uncased
4https://github.com/google-research/bert/blob/master/multilingual.md
5https://huggingface.co/dbmdz/bert-base-italian-cased
6https://huggingface.co/dbmdz/bert-base-italian-xxl-cased
7https://huggingface.co/sapienzanlp/Minerva-350M-base-v1.0
measures. In order to compare plausibility and perplexity, the sign of the correlation with plausibility
has been inverted; we will henceforth refer to it as pseudo-perplexity.</p>
        <p>The analysis of results was performed on three diferent levels: on the entire dataset (sect. 3.1) and,
to account for genre and perturbation diferences, separating instances by their source (sect. 3.2 ) and
for each source by their perturbation type (sect. 3.3).</p>
        <sec id="sec-3-1-1">
          <title>3.1. Overall Analysis</title>
          <p>The Spearman correlation of the models’ prediction and human judgments of coherence is shown in
tables 2 and 3, which cover approaches using inter-sentence distance and (pseudo)perplexity respectively.
Coeficients marked with an asterisk are statistically significant.</p>
          <p>Results for each tested methodology and model are always above the baseline, except for the
intersentence standard deviation of some models. Globally, the strongest correlation with human judgment
is obtained by perplexity with Minerva (-0.46) and mBERT (-0.45), and by the average inter-sentence
cosine distances calculated with sBERT (-0.45). Among the best are also the average cosine (-0.43) and
Euclidean (-0.43) distances with LaBSE, and the average Euclidean distance with sBERT (-0.42).</p>
          <p>Comparing the diferent functions of inter-sentence distance, standard deviation appears to be an
unreliable approach: results often lacked correlation with human judgment or had very low coeficients
compared to mean inter-sentence distance, which is also (almost) always statistically significant. Across
all models, the standard deviation averages -0.08 and -0.11 correlation for Euclidean and cosine distance
respectively, while the mean inter-sentence distance averages, respectively, -0.30 and 0.31. These
results also highlight a diferent trend, which is that Euclidean distance generally obtains lower scores
than cosine distance. This diference is even more pronounced (-0.29 vs -0.32, -0.07 vs -0.11) when
leaving out the one notable exception, Minerva, which despite great results with Euclidean distances
has 0 to non-significant correlation with cosine distance. The best approach however remains using
perplexity or pseudo-perplexity (-0.38), whose average is much higher than what the same models
averaged when using mean inter-sentence distance (Euclidean: -0.27, cosine: -0.25). For what concerns
sentence encodings, embeddings obtained with CLS are significantly worse, not only with respect to
the corresponding mean-pooled embeddings (as already suggested by the literature: see e.g. [32]) but
also to every other model.</p>
          <p>Among the diferent architectures, sentence encoders obtain overall the best results: they obtain
the highest correlation scores and there is a great consistency among diferent models: their average
mean inter-sentence distance coeficient is -0.40 for Euclidean distance and -0.41 for cosine distance,
both higher than the average perplexity score. Their predictions have consistently lower correlation
than other models (or no correlation at all) only when observing standard deviation of inter-sentence
distance. On the other hand IT5, which represents encoder-decoders, has a good correlation with human
judgment both by mean and standard deviation of inter-sentence distance, the latter being statistically
significant and the highest among all models but rather low. The diferent IT5 versions have, with mean
inter-sentence distance, lower correlation than sentence encoders, but still perform on par with the best
encoder models. Overall, the correlation with human judgment of encoder results appears very variable.
However, the high variability is due to the much lower performance obtained when using CLS, instead
of mean-pooling, as a sentence representation method: this strategy results in little to no significant
correlation with human judgments. Considering only encoders with the mean-pooling strategy, the
average correlation with human judgment is fairly high, although inferior to that of sentence encoders
and IT5. Decoders have the highest variability among results depending on the specific model: Minerva
obtains the best results, but only using Euclidean distance, while methods employing cosine distance
lead to non-significant results; GePpeTto has the lowest perplexity scores but performs well with mean
inter-sentence distance; lastly, GroGPT has a low but consistent performance with both mean and
standard deviation inter-sentence distance but has good perplexity scores.</p>
          <p>For what concerns parameter size, this does not seem to influence results neither comparing diferent
sizes of the same model (for those where the comparison is possible) nor considering the absolute
parameter size: only with sentence encoder the correlation increases in parallel with parameter size,
but the diference is rather small. Training dataset size, on the other hand, consistently changes the
performance from BERT-ita to BERT-ita-xxl, especially when using CLS.</p>
          <p>It is instead unclear how, or if, target language influences performance: multilingual encoders perform
in the middle between BERT-ita and BERT-ita-xxl when considering inter-sentence distances, hinting
at a possible advantage, and with (pseudo)perplexity multilingual BERT has a correlation score which
is almost on par with that of the much bigger Minerva. In sentence encoders, however, the opposite
seems to be true: sBERT performs comparably to the best multilingual sentence encoder, LABSE, despite
considerable parameter size diference.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.2. Analysis by source</title>
          <p>
            Tables 4 and 5 show the correlation coeficients of model predictions with human judgments when
dividing the dataset by source. Confirming the results of [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ], performance on the TEDx and the
Wikipedia sections is very diferent, with the first obtaining higher coeficients with all coherence
assessment approaches (with the only exception of the mean inter-sentence cosine distance calculated
with IT5 base). On Wikipedia, the correlation with human judgments is more likely to be not significant
and its coeficient is stronger than -0.2 only with sentence encoders or through (pseudo)perplexity
scores: excluding standard deviation of inter-sentence distance, the average Wikipedia coeficient is
-0.12, while for TEDx it is -0.24. This could be influenced by the fact that human coherence judgments
on Wikipedia texts are more densely distributed in the upper (high-coherence) range, as exemplified in
ifgure 1; it is also worth noting that higher coherence values for Wikipedia texts were also produced
by almost all models regardless of approach, although the entity of this diference varied consistently
between models.
          </p>
          <p>In line with what was observed on the entire dataset, sentence encoders remain the best performing
architecture and (pseudo)perplexity the most efective coherence assessment approach. The unsuitability
of standard deviation of inter-sentence distance and using CLS as sentence encoder is also further
confirmed, with cosine distance still obtaining better results than Euclidean distance. Minerva also
maintains the skewness in results between approaches using Euclidean and cosine distances.</p>
          <p>The combination that achieves the highest correlation with human judgment is perplexity
calculated with Minerva (-0.46 and -0.26 on TEDx and Wikipedia respectively), followed closely by
pseudo-perplexity calculated with BERT-ita-xxl (-0.43, -0.23); the average inter-sentence cosine distance
calculated with sBERT (-0.38, -0.24) also obtains satisfying results.</p>
          <p>As we already observed, standard deviation of inter-sentence distance is not a reliable coherence
indicator: correlation with human judgment is hardly ever significant, and when it is, it is only significant
on one of the two classes and with very low coeficients. Perplexity and pseudo-perplexity on the other
hand, with the sole exception of GePpeTto, obtain much higher correlation on TEDx than any other
approach and keep consistently high coeficients on Wikipedia, where most other performances falter.
Besides GePpeTto, perplexity results vary significantly between models on TEDx (from 0.32 to 0.46)
but are mostly identical (from -0.23 to -0.26) on Wikipedia; moreover, the best perplexity results gain a
noticeable margin from those of inter-sentence distances (0.08) on TEDx, while those of the Wikipedia
improve only 0.02. This reduced improvement brought about by perplexity, together with the overall
lower performance on the Wikipedia section, supports our claim that its presence in most training
datasets is ofset by using human coherence judgments, and not perturbation labels, for the evaluation.</p>
          <p>Sentence encoders remain the best performing architecture, not only for their higher correlation
coeficients on the TEDx section but also and especially for their performance on the Wikipedia section,
which is always significant and higher on average than any other architecture. As was the case on
the overall dataset, their average score (-0.35 for TEDx and -0.19 for Wikipedia) is close to the average
perplexity (-0.36 and -0.21 respectively), although this time slightly lower. Sentence encoders also
appear in general much less sensitive than other architectures to the type of distance used, except for
sBERT. Encoders maintain a certain variability depending on the models and are still comparable to
decoders when using perplexity, but this time with inter-sentence distances they perform generally
better than IT5, for which the Wikipedia class has always low to no correlation with human judgment.
Decoders, on the other hand, have consistently low performances on inter-sentence distance tasks, in
contrast with what previously observed; the sole exception is GePpeTto, when leveraging the mean
cosine distance: once more his leading role as an encoder reverses as a decoder, where he obtains the
lowest scores.</p>
          <p>For what concerns parameters and training data size, no significant diferences were observed. The
impact of the target language is, however, unclear: overall there seems to be no clear preference
for either, and the direct comparison between mBERT and BERT-ita/BERT-ita-xxl shows the former
outperforming the latter on inter-sentence distance tasks and the opposite on pseudo-perplexity.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.3. Analysis by source and perturbation</title>
          <p>As we already observed, the huge diferences between TEDx and Wikipedia (both in terms of human
judgment and model behavior) are such that the impact of diferent kinds of perturbation can only be
observed by keeping the two genres separate. The impact of the aforementioned diference can be very
clearly seen also at this level of analysis: correlation results exhibit strong diferences between the TEDx
and Wikipedia sections, both in terms of significance and class distribution. Not only do the TEDx
section results show higher correlation coeficients, but in the Wikipedia section, most correlations are
not significant. Furthermore, the perturbation class with the highest correlation with human judgment
for TEDx, namely the inversion class, is never significant in the Wikipedia section except when using
(pseudo)perplexity. It is worth noting that how the perturbation classes rank in terms of performance is
not only diferent between TEDx and Wikipedia, and for Wikipedia between inter-sentence
distancebased methods and perplexity, but also that these diferences are not aligned with inter-annotator
agreement. The only common factor between the two sections is the efectiveness of pseudo-perplexity
and sentence encoders and the inefectiveness of standard deviation of inter-sentence distance. Due to
the significant diferences, the two sections are treated separately.
3.3.1. TEDx
The correlation of models’ predictions and human judgment in the TEDx section is shown in tables 6
and 7, for inter-sentence distance approaches and (pseudo)perplexity respectively. Among the diferent
perturbation classes, the inversion class generally has the strongest correlation with human judgment,
generally followed by the class without alterations and then by the substitution class. Correlation is
generally statistically significant except for standard deviation of inter-sentence distance measures,
which as we already observed is an unreliable approach. The only other cases of non-significant
coeficients are with mean inter-sentence distance, mainly in the substitution class and only in a few
cases in the unaltered class, mostly when using CLS as sentence encoders.</p>
          <p>The highest correlation with human judgment was obtained by Minerva’s perplexity (-0.45 unaltered,
-0.51 inversion, -0.39 substitution), followed by BERT-ita-xxl’s pseudo-perplexity (-0.45, -0.45, -0.33) and
LABSE’s mean inter-sentence cosine distance (-0.35, -0.42, -0.25).</p>
          <p>As always, perplexity was the approach with highest correlation to human judgment, although with
considerable internal variability on the basis of the model: it averaged -0.35 for the unaltered class,
-0.39 for the inversion class, and -0.30 for the substitution class. Also in line with previous observations,
sentence encoders remain the best performing architecture, obtaining consistently high results and
averaging -0.34, -0.36, and -0.23 for the unaltered, inversion, and substitution classes respectively, not
too far from the perplexity scores. The role of the model language (multilingual or Italian) remains
however unclear, following the same patterns observed in the dataset divided by source.</p>
          <p>Generally speaking, this level of analysis is mostly coherent with the previous ones. Some diferences
concern the role of parameter and training size. While parameter size still does not seem relevant in
absolute terms, this time it has a positive impact when considering diferent sizes of MUSE and XLM-R
(although it seems almost counterproductive on IT5). Similarly, training data size improves performance
from BERT-ita to BERT-ita-xxl (especially with mean-pooling), but does not seem to influence other
models. The diference between Euclidean and cosine distances is also reduced.
3.3.2. Wikipedia
Tables 8 and 9 show the correlation between the model’s predictions and the human coherence judgments.
The most interesting results concern the perturbation classes: the performance ranking of the diferent
classes is not only diferent from that of TEDx, but also diferent between approaches using
intersentence distance and using (pseudo)perplexity. In the first case, the substitution class has the highest
correlation with human judgment, while the inversion class performs the worst never being statistically
significant; when using (pseudo)perplexity, on the other hand, the inversion class is the one which
correlates the most with human judgment, followed by the substitution class. Upon further inspection,
on the substitution and unaltered classes there is not much diference between the average performance
using (pseudo)perplexity (-0.15 and -0.22 respectively) or mean inter-sentence distance (-0.14 and -0.17,
excluding outliers like CLS embeddings and cosine Minerva), especially if considering cosine distance
(0.15 and -0.20). What changes, radically, is performance in the inversion class, going from no correlation
to -0.25.
sub</p>
          <p>There is an overall drop in performance, with an increased number of results that are not statistically
significant. Comparing the diferent approaches, the pattern is the same as at the other levels of
analysis: standard deviation is the worst, as it is almost never significant, and perplexity performs the
best, especially since it is the only methodology where all three classes are statistically significant.
Moreover, with (pseudo)perplexity all models (except for GePpeTto) always have statistically significant
results, while with mean inter-sentence distance only about half of the results of the unaltered and the
substitution class are statistically significant.</p>
          <p>The highest correlation scores are obtained by Minerva with perplexity (-0.18 unaltered, -0.32
inversion, -0.26 substitution), mBERT with pseudo-perplexity (-0.21, -0.28 and -0.25 respectively),
and sBERT with mean inter-sentence cosine distance (-0.28, -0.07, -0.34).</p>
          <p>Results are in line with what we observed on the overall dataset and considering sources separately;
sub
sentence encoders, in particular, are the only models that manage to have two statistically significant
classes with distance-based approaches. There is only a slight diference in what concerns the impact
of language. When directly comparing mBERT and BERT-ita and BERT-ita-xxl, the first has better
performances both with inter-sentence distance measures and pseudo-perplexity measures, and overall
multilingual models seem to be performing better (except for sBERT among sentence encoders).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>We evaluated the coherence assessment abilities of 15 small Italian language models, which varied in
their structural and training-related characteristics, using two unsupervised approaches: modeling
coherence based on inter-sentence semantic distance and perplexity. We evaluated results by their
correlation with human judgment of coherence and analysed our dataset at diferent levels, to monitor
diferences related to the genre of the target text and the perturbation it was subjected to.</p>
      <p>Perplexity and pseudo-perplexity consistently obtain the highest correlation with human judgments
and seem to be the most efective coherence assessment methods. When considering distance measures,
the accuracies obtained with sentence encoders were comparable to those of (pseudo)perplexity. Cosine
distance appeared to be slightly better than Euclidean distance, while sentence embeddings through
CLS and standard deviation of a paragraph’s inter-sentence distance proved to be unsuitable. With
perplexity and pseudo-perplexity the single most impactful decision seemed to be the model, regardless
of parameter size or architecture; conversely, architecture was the most influential factor with
intersentence distance approaches, with sentence encoders obtaining by far the best results. This was shown
not only by higher correlation coeficients but also by the very close range of values produced by the
models, underlining the reliability of the approach. Model and training set size did not seem to influence
much performance, while model language (multilingual or Italian) had contradictory results.</p>
      <p>Textual genre was shown to heavily influence model performance, both quantitatively and
qualitatively, with TEDx always obtaining much higher correlation coeficients than Wikipedia. It is unlikely
that they have been influenced by the presence of Wikipedia in the training, given both the lower
results and the evaluation against human judgments. These results could instead be influenced by the
wider value range in human judgments registered on the former, aiding a ranking-based correlation
measure, which underlines the relevance of considering genre in performance evaluation.</p>
      <p>The impact of diferent sources is also clear when the efect of perturbations is analyzed. Perturbation
classes not only exhibited markedly diferent behavior but also had diferent results depending on
the source of the paragraph. The clearest example is that of the inversion class, which performed
the best on TEDx, while on Wikipedia obtained good results with (pseudo)perplexity but was never
statistically significant with distance measures. Inversions impact order, which is more easily picked
up by a sequence-based metric like perplexity than by a semantically-rooted distance measure. Both
were enough to pick up alterations on TEDx, but only the former was efective on Wikipedia due to
its higher thematic coherence, highlighting the importance of considering the perturbations both in
isolation and in their interaction with other textual characteristics.
sub
sub</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This paper is supported by the PRIN 2022 PNRR Project P20227PEPK (EKEEL - Empowering Knowledge
Extraction to Empower Learners), funded by the European Union – Next Generation EU, and the LuCET
- LingUistic Complexity Evaluation in educaTion - project under the PRIN grant no. 2022KPNY3B
funded by the Italian Ministry of University and Research.
Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics, Online,
2021, pp. 836–846. doi:10.18653/v1/2021.findings-acl.74.
[31] L. De Mattei, M. Cafagna, F. Dell’Orletta, M. Nissim, M. Guerini, Geppetto carves italian into a
language model, arXiv preprint arXiv:2004.14253 (2020). doi:10.48550/arXiv.2004.14253.
[32] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using Siamese BERT-networks,
in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),
2019, pp. 3982–3992.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bonetta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Hromei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Siciliani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Stranisci</surname>
          </string-name>
          , Preface to the
          <source>Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI)</source>
          ,
          <source>in: Proceedings of the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI</source>
          <year>2024</year>
          )
          <article-title>co-located with 23th International Conference of the Italian Association for Artificial Intelligence (AI*IA</article-title>
          <year>2024</year>
          ),
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G. L.</given-names>
            <surname>Beccaria</surname>
          </string-name>
          ,
          <article-title>Dizionario di linguistica e di filologia, metrica</article-title>
          , retorica, Einaudi,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Verberne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Boves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Oostdijk</surname>
          </string-name>
          , P.-A. Coppen,
          <article-title>Evaluating discourse-based answer extraction for why-question answering</article-title>
          ,
          <source>in: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          ,
          <year>2007</year>
          , pp.
          <fpage>735</fpage>
          -
          <lpage>736</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Elvevåg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. W.</given-names>
            <surname>Foltz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. E.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          ,
          <article-title>Quantifying incoherence in speech: an automated methodology and novel application to schizophrenia</article-title>
          ,
          <source>Schizophrenia research 93</source>
          (
          <year>2007</year>
          )
          <fpage>304</fpage>
          -
          <lpage>316</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Iter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          ,
          <article-title>Automatic detection of incoherent speech for diagnosing schizophrenia</article-title>
          ,
          <source>in: Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>136</fpage>
          -
          <lpage>146</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Muangkammuen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fukumoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Saikaew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>A neural local coherence analysis model for clarity text scoring</article-title>
          ,
          <source>in: Proceedings of the 28th international conference on computational linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>2138</fpage>
          -
          <lpage>2143</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gerani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mehdad</surname>
          </string-name>
          , G. Carenini,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Nejat</surname>
          </string-name>
          ,
          <article-title>Abstractive summarization of product reviews using discourse structure</article-title>
          ,
          <source>in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>1602</fpage>
          -
          <lpage>1613</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tetreault</surname>
          </string-name>
          ,
          <article-title>Discourse coherence in the wild: A dataset, evaluation and methods</article-title>
          ,
          <source>in: Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>214</fpage>
          -
          <lpage>223</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F. S.</given-names>
            <surname>Mim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Inoue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Reisert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ouchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Inui</surname>
          </string-name>
          ,
          <article-title>Unsupervised learning of discourse-aware text representation for essay scoring</article-title>
          ,
          <source>in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>378</fpage>
          -
          <lpage>385</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mistica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Salehi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Baldwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <article-title>Evaluating document coherence modeling, Transactions of the Association for Computational Linguistics 9 (</article-title>
          <year>2021</year>
          )
          <fpage>621</fpage>
          -
          <lpage>640</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Papa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Brunato</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>Dell'Orletta, Unraveling text coherence from the human perspective: a novel dataset for italian</article-title>
          ,
          <source>in: Proceedings of the Ninth Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2023</year>
          ),
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Loáiciga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schlangen</surname>
          </string-name>
          ,
          <article-title>Is incoherence surprising? targeted evaluation of coherence prediction from language models</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>4164</fpage>
          -
          <lpage>4173</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Ng</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kan</surname>
          </string-name>
          ,
          <article-title>Automatically evaluating text coherence using discourse relations</article-title>
          ,
          <source>in: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>997</fpage>
          -
          <lpage>1006</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L.</given-names>
            <surname>Pishdad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fancellu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , A. Fazly,
          <article-title>How coherent are neural models of coherence?</article-title>
          ,
          <source>in: Proceedings of the 28th International Conference on Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>6126</fpage>
          -
          <lpage>6138</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H. C.</given-names>
            <surname>Moon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Mohiuddin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Joty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>A unified neural coherence model</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2262</fpage>
          -
          <lpage>2272</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Mohiuddin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jwalapuram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Joty</surname>
          </string-name>
          ,
          <article-title>Rethinking coherence modeling: Synthetic vs. downstream tasks</article-title>
          ,
          <source>in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:</source>
          Main Volume,
          <year>2021</year>
          , pp.
          <fpage>3528</fpage>
          -
          <lpage>3539</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R.</given-names>
            <surname>Barzilay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lapata</surname>
          </string-name>
          ,
          <article-title>Modeling local coherence: An entity-based approach</article-title>
          ,
          <source>Computational Linguistics</source>
          <volume>34</volume>
          (
          <year>2008</year>
          )
          <fpage>1</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>P.</given-names>
            <surname>Laban</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bandarkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hearst</surname>
          </string-name>
          ,
          <article-title>Can transformer models measure coherence in text: Re-thinking the shufle test</article-title>
          ,
          <source>in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>1058</fpage>
          -
          <lpage>1064</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>D.</given-names>
            <surname>Iter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Guu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lansing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          ,
          <article-title>Pretraining with contrastive sentence objectives improves discourse performance of language models</article-title>
          ,
          <source>in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>4859</fpage>
          -
          <lpage>4870</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Maimon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tsarfaty</surname>
          </string-name>
          ,
          <article-title>A novel computational and modeling foundation for automatic coherence assessment</article-title>
          ,
          <source>arXiv preprint arXiv:2310.00598</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>P.</given-names>
            <surname>Huber</surname>
          </string-name>
          , G. Carenini,
          <article-title>Towards understanding large-scale discourse structures in pre-trained and fine-tuned language models</article-title>
          ,
          <source>in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>2376</fpage>
          -
          <lpage>2394</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Duari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bhatnagar</surname>
          </string-name>
          ,
          <article-title>Ffcd: A fast-and-frugal coherence detection method</article-title>
          ,
          <source>IEEE Access 10</source>
          (
          <year>2021</year>
          )
          <fpage>85305</fpage>
          -
          <lpage>85314</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>F.</given-names>
            <surname>Koto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Lau</surname>
          </string-name>
          , T. Baldwin,
          <article-title>Discourse probing of pretrained language models</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>3849</fpage>
          -
          <lpage>3864</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>D.</given-names>
            <surname>Brunato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Colla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Dini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Radicioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ravelli</surname>
          </string-name>
          , Discotex at evalita
          <year>2023</year>
          <article-title>: overview of the assessing discourse coherence in italian texts task, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          .
          <source>Final Workshop (EVALITA</source>
          <year>2023</year>
          ). CEUR. org, Parma, Italy,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sprugnoli</surname>
          </string-name>
          , G. Venturi,
          <year>Evalita 2023</year>
          :
          <article-title>Overview of the 8th evaluation campaign of natural language processing and speech tools for italian, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          .
          <source>Final Workshop (EVALITA</source>
          <year>2023</year>
          ), CEUR. org, Parma, Italy,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>F.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Arivazhagan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Language-agnostic BERT sentence embedding</article-title>
          , in: S. Muresan,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Villavicencio (Eds.),
          <source>Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>878</fpage>
          -
          <lpage>891</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>62</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>62</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Law</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Constant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. Hernandez</given-names>
            <surname>Abrego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tar</surname>
          </string-name>
          , Y.-h. Sung,
          <string-name>
            <given-names>B.</given-names>
            <surname>Strope</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kurzweil</surname>
          </string-name>
          ,
          <article-title>Multilingual universal sentence encoder for semantic retrieval</article-title>
          , in: A.
          <string-name>
            <surname>Celikyilmaz</surname>
          </string-name>
          , T.-H. Wen (Eds.),
          <article-title>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>87</fpage>
          -
          <lpage>94</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-demos.
          <volume>12</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J. Tetreault (Eds.),
          <article-title>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>8440</fpage>
          -
          <lpage>8451</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>747</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>G.</given-names>
            <surname>Sarti</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Nissim, It5: Large-scale text-to-text pretraining for italian language understanding and generation</article-title>
          ,
          <source>arXiv preprint arXiv:2203.03759</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>W. de Vries</surname>
          </string-name>
          , M. Nissim,
          <article-title>As good as new. how to successfully recycle English GPT-2 to make models for other languages</article-title>
          , in: C.
          <string-name>
            <surname>Zong</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Navigli</surname>
          </string-name>
          (Eds.),
          <article-title>Findings of the Association for</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>