<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>IUSSNets at DisCoTeX: A fine-tuned approach to coherence</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Emma Zanoli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matilde Barbini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristiano Chesi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University School for Advanced Studies IUSS Pavia - NeTS Lab</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present our submission to the DisCoTex shared task of the EVALITA 2023 evaluation campaign, which focuses on modeling discourse coherence for Italian texts. We highlight the importance of coherence modeling in natural language processing tasks and briefly discuss related work, including earlier linguistic theories and recent neural models. To tackle the task, we leverage pre-trained Transformer models and fine-tune them on the provided datasets. Our approach incorporates monolingual models due to limited computing resources, but shows potential for multilingual and multitask learning. Our systems ranks second overall, showing that Transformer models can be fruitfully leveraged for coherence assessment, but more work is needed to fully exploit their capabilities. The coherence assessment literature focuses primarily on English; this shared task and our work contribute to broadening the scope of current research.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;coherence</kwd>
        <kwd>Transformers</kwd>
        <kwd>NLP</kwd>
        <kwd>computational linguistics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Our implementation choices are informed by [15], who</title>
        <p>are among the first to use Transformer models for
coherWritten texts are often a sequence of semantically co- ence assessment.
herent segments, designed to create a smooth transition It is interesting to note that the literature on coherence
between various subtopics [1]. Modeling coherence can ifnds significant overlap with the literature on readability.
be done by building text analysis models that can distin- The two are often likened and used as general measures
guish a coherent text from incoherent ones, or that can of textual quality [9]. Sometimes, coherence is used as
output a coherence score [2]. It has been a key problem an additional feature in readability assessment [12].
in discourse analysis, with applications in many down- By and large, the literature on automatic assessment
stream NLP tasks (e.g. text generation, summarization, of discourse coherence focuses on the English language.
machine translation, dialogue generation, etc.). One notable exception is [16] for Danish.</p>
        <p>Coherence modeling is at the heart of the DisCoTex
shared task [3] of the EVALITA 2023 evaluation campaign
[4]. This report relates the motivation and implementa- 3. Task
tion of the IUSSnets team’s submission.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>Early computational models for text coherence
assessment were mainly based on one of two linguistic theories:
a) centering theory [5] and b) rhetorical structure theory
[6]. In line with the first, [ 7] and [8] use the
distribution of entity transitions over sentences to predict text
coherence. In line with the second, [9] and [10]
produce discourse relations over sentences with a discourse
parser, showing that the relations are indicative of text
coherence.</p>
      <p>More recently, neural models have gained prominence
in the task of coherence assessment. Popular examples
are [11], [12], [13], and the recent state-of-the-art [14].</p>
      <p>DisCoTEX is the first shared task focused on modelling
discourse coherence for Italian real-word texts. The
organizers proposed two sub-tasks:
• Sub-task 1 - Last sentence classification : a
binary classification task. Given a short
paragraph (the prompt), and an individual sentence
(the target), the goal is to classify whether the
target follows or not, i.e. whether joining it to the
prompt gives out a coherent or incoherent text.
• Sub-task 2 - Human score prediction: a
regression task. The goal is to predict the
average coherence score assigned by human raters to
short paragraphs. Judgments are expressed on a
5-point Likert Scale.
4. DisCoTex Data
analyzed within text passages of four consecutive sen- In the following we provide an overview of the main
tences. For task 1, these were split into 8000 prompt- intuition for each model.
target pairs for each domain: the prompt is always made BERT by Google [22] introduced “masked language
of the first three consecutive sentence, whereas the tar- modeling” (MLM): some of the input tokens were masked,
get can either be the actual last sentence of the passage and the pre-training objective was to predict the original
(for the positive class) or a diferent one (for the nega- vocabulary id of the masked word based only on its
contive class). This dataset is automatically generated. For text. MLM enabled the representation to fuse the left and
task 2 there were 1064 text passages, equally balanced the right context, leading to a bidirectional Transformer.
across the two original source datasets, of which 50% In addition to MLM, they also used a “next sentence
prewere left unaltered and 50% were artificially modified to diction” task that jointly pre-trained text-pair
representaundermine coherence. This dataset was not automati- tions. After pre-training, BERT could be fine-tuned with
cally generated: each passage was annotated by at least just one additional output layer to create state-of-the-art
10 human evaluators who were native speakers of Italian. models for a wide range of tasks, without substantial
task-specific architecture modifications.</p>
      <p>DistilBERT by HuggingFace [23] leveraged
knowl5. Description of the system edge distillation during the pre-training phase, thus
reducing the size of a BERT model by 40%, while retaining
5.1. General intuition 97% of its language understanding capabilities and
beFor this challenge we leveraged pre-trained Transformer ing 60% faster. To leverage the inductive biases learned
models and fine-tuned them on the provided data. by larger models during pre-training, they introduced a</p>
      <p>Transformer models [17] have been applied with triple loss combining language modeling, distillation and
tremendous success to the field of NLP. They have been cosine-distance losses.
shown to capture semantic relationships to a reasonable RoBERTa by Facebook AI [24] applied various
preextent. As reported in Section 2, they have already suc- training enhancements to the original BERT model:
cessfully been applied to the task of discourse coherence longer training on longer sequences, bigger batches over
modeling. more data, no next sentence prediction objective, and</p>
      <p>Since the DisCoTex task is tailored specifically to the dynamically changing the masking pattern applied to the
Italian language, we decided to leverage monolingual training data. These modifications advanced the state of
Transformer models that had been pre-trained exclu- the art on diferent downstream tasks.
sively on Italian data. Given that coherence assessment ELECTRA by Stanford and Google [25] introduced a
datasets are available for English, we initially intended new pre-training task called "replaced token detection":
to experiment with multi-lingual transfer learning, using instead of masking the input, they corrupted it by
remultilingual pre-trained Transformer models and fine- placing some tokens with plausible alternatives sampled
tuning them simultaneously on English and Italian data. from a small generator network. Then, instead of
predictUnfortunately, our limited computing resources did not ing the original identities of the corrupted tokens, they
allow us to get this far within the time frame of the shared trained a discriminative model that predicts whether each
task. Preliminary results indicate that this would have token in the corrupted input was replaced by a generator
been a promising approach. sample or not. The model showed competitive
performance compared to other models, while requiring fewer
resources for training.
5.2. Pre-trained models As previously stated, we experimented with
monolinWe experimented with 4 monolingual pre-trained models, gual Italian versions of these models, i.e. models that
freely available on the HuggingFace hub [18] at the time were trained using the same approaches as the ones
deof writing: scribed above, but solely on Italian data. These models
were used to encode the input and return a vector
representation from the last layer output (i.e. the [CLS] token,
which was taken to signify a vector representation of the
sentence).
• bert-ita1: an Italian version of BERT [19];
• electra-ita2: an Italian version of ELECTRA</p>
      <p>[19];
• umberto3: an Italian version of RoBERTa [20];
• bertino4: an Italian version of DistilBERT [21]. 5.3. Fine-tuning
1https://huggingface.co/dbmdz/bert-base-italian-xxl-cased The pre-trained models were fine-tuned on the available
2https://huggingface.co/dbmdz/electra-base-italian-xxl-cased-discrimindaatotra for 10 epochs, using the following hyper-parameters:
3uhmttpbse:r/t/oh-ucgogminmgofanccer.acwo/lM-cuasseixdm-va1tch/ 0.1 dropout rate, 0.01 weight decay, 1e-6 learning rate, a
4https://huggingface.co/indigo-ai/BERTino batch size of 1 and no gradient clipping. We used the
cur</p>
      <sec id="sec-2-1">
        <title>Training data</title>
        <p>Wikipedia, OPUS [27], OSCAR [28]
Wikipedia, OPUS [27], OSCAR [28]
OSCAR [28] - deduplicated
PAISÀ [29], ItWaC [30]</p>
        <sec id="sec-2-1-1">
          <title>During fine-tuning, we only relied on the provided datasets. However, we used Transformer models which had been pre-trained on a variety of data sources (see Table 1).</title>
          <p>For sub-task 2 we attempted some data augmentation
techniques. Since we had a dataset were each sentence
had a mean score based on at least 10 judgments, we
leveraged the standard deviation to generate a
distribution of 10 scores that would have the provided mean and 6.1. Sub-task 1 - evaluation results
standard deviation. We thus ended up with 10 scores
for each sentence, instead of an average score. However, In the absence of a test or validation set, we sampled 20%
upon training our models on this augmented dataset, we of the original training sets for preliminary evaluation.
did not notice any significant improvements and, because This resulted in 1600 randomly sampled data points for
this approach was more resource-intensive, we eventu- each dataset. On these sub-sets, we calculated the binary
ally dropped it. accuracy as implemented in the torchmetrics Python</p>
          <p>Please note that we only made use of 80% of the pro- library5. We report results in Table 4.
vided datasets during fine-tuning; the remaining 20% was
used as a validation split (more details below).
6.2. Sub-task 2 - evaluation results</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6. Results</title>
      <sec id="sec-3-1">
        <title>For the purposes of the oficial rankings, our results are:</title>
        <p>0.72 on sub-task 1, 0.63 on sub-task 2.</p>
        <p>For sub-task 1, the organizers considered the accuracy
of the best run and computed the mean between the best
results on the two datasets (Ted and Wiki). For sub-task
2, they first computed both Pearson and Spearman
correlations, then they applied the harmonic mean between
the two measures. Participants were allowed to submit</p>
      </sec>
      <sec id="sec-3-2">
        <title>In the absence of a test or validation set, we sampled 20%</title>
        <p>of the original training set for preliminary evaluation.
This resulted in 172 randomly sampled data points. On
this sub-set, we computed the Spearman correlation
coeficient as implemented in the scipy Python library6.
We report results in Table 5.</p>
      </sec>
      <sec id="sec-3-3">
        <title>5https://torchmetrics.readthedocs.io/en/stable/classification/</title>
        <p>accuracy.html#binaryaccuracy
6https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.
spearmanr.html
Model
bert-ita
electra-ita
umberto
bertino
bert-ita
electra-ita
bert-ita
electra-ita</p>
        <p>Dataset
wiki
wiki
wiki
wiki
all
all
ted
ted</p>
        <p>Accuracy
0.749
0.716
0.595
0.637
0.723
0.583
0.704
0.617</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>7. Discussion</title>
      <p>ciation for its coherence are driven by active inference,
background knowledge, and a degree of imagination.</p>
      <p>It comes as no surprise, then, that the many facets
of this uniquely human experience are hard to model
computationally. In order to get a sense for this, we
looked into the dataset collected for sub-task 2. Overall,
the majority of the training dataset contained texts rated
3.0 or higher; in other words, the texts were perceived
as mostly coherent. It would have been interesting to
compare how the annotators rated original vs. artificially
modified text passages. Although we did not have this
information in the dataset, when comparing the datasets
for sub-task 1 and 2, we found 19 passages in the dataset
for sub-task 2 in the positive class of the ted dataset
for sub-task 1: this means that these passages had not
been modified from their original sources and were thus
expected to be coherent. Of these 19 passages:
• none were unanimously rated as coherent, i.e. a
mean score of 5 (0%);
• 4 received a mean score of 4 or above (21%);
• 10 received a rating between 3 and 4 (53%);
• 4 received a rating between 2 and 3 (21%);
• 1 even received a rating below 2 (5%).</p>
      <sec id="sec-4-1">
        <title>IUSSnets</title>
        <p>baseline</p>
        <sec id="sec-4-1-1">
          <title>The DisCoTex shared task provided us with an excellent</title>
          <p>opportunity to reflect on the notion of disourse coherence
and on the ways it may be assessed, whether automati- If we were to revert these scores back to a binary
classically or not. ifcation (with a halfway cutof at 2.5), 5 of these passages</p>
          <p>As a preamble, let us note that datasets for coher- would be considered incoherent. However, for the
purence assessment that are automatically created by shuf- poses of sub-task 1, they would have been considered
lfing existing texts have been criticized, among others, coherent. This simplistic example is in no way an
exby [31] and [32], and the models trained on them have haustive exploration of the nature of the tasks or the
been shown to perform weakly on downstream tasks provided datasets, but it serves the purpose of reflecting
[2]. Nonetheless, such datasets have remained common on the dificulty of modeling these phenomena from a
benchmarks. more explicit (linguistic or cognitive) perspective.</p>
          <p>Discourse coherence is a complicated concept that is re- Deep learning models generally, and Transformers
lated to almost every aspect of discourse communication. specifically, have been shown to capture useful semantic
In the linguistics literature, there is no all-embracing rule information in texts. Previous work has investigated
governing coherence analysis: diferent scholars have Transformers for their semantic [34] and even pragmatic
presented their insight into diferent aspects of discourse [35] properties. For these reasons, we hypothesized that
coherence [33]. When we read a text or listen to speech, Transformer models would be a good fit for the task of
we are inclined to infuse it with coherence by making coherence assessment. Indeed, even in our simple setup,
our own inferences based on our understanding and per- we can see promising results. Further experimentation
ception. Coherence is therefore achieved not by using and greater computational power could lead to significant
superficial markers such as linguistic or grammatical de- performance improvements. Multilingual and multi-task
vices, but through psychological, cognitive, or pragmatic learning might prove particularly efective in boosting
means. The comprehension of discourse and an appre- performance on Italian texts by leveraging datasets that
exist for the English language or for other related tasks.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>We are thankful to the bright research community at</title>
        <p>the NeTS lab of IUSS Pavia, who encouraged and
supported these experiments. This research has been
partially founded by the PON Governance 2014-2020: Next
Generation UPP Project - CUP D19J22000240006.</p>
        <p>Moving forward, further exploration of linguistic
theories and neural models can enhance discourse coherence
assessment and facilitate more sophisticated language
processing applications. Focusing on more controlled
textual continuations (e.g. diferent logical conclusions
from specific premises) would shed some light on the
relevance of specific factors in coherence modeling. This
would also allow us to better understand the strengths
and weaknesses of a transformers-based approach.
ral graph-based local coherence model, in: Find- nators rather than generators, in: ICLR, 2020. URL:
ings of the Association for Computational Lin- https://openreview.net/pdf?id=r1xMH1BtvB.
guistics: EMNLP 2021, Association for Computa- [26] D. P. Kingma, J. Ba, Adam: A method for stochastic
tional Linguistics, Punta Cana, Dominican Repub- optimization, 2017. arXiv:1412.6980.
lic, 2021, pp. 2316–2321. URL: https://aclanthology. [27] J. Tiedemann, Parallel data, tools and interfaces in
org/2021.findings-emnlp.199. doi: 10.18653/v1/ opus, in: N. C. C. Chair), K. Choukri, T. Declerck,
2021.findings-emnlp.199. M. U. Dogan, B. Maegaard, J. Mariani, J. Odijk,
[15] T. Abhishek, D. Rawat, M. Gupta, V. Varma, Trans- S. Piperidis (Eds.), Proceedings of the Eight
Internaformer models for text coherence assessment, 2022. tional Conference on Language Resources and
EvalarXiv:2109.02176. uation (LREC’12), European Language Resources
[16] L. Flansmose Mikkelsen, O. Kinch, A. Jess Peder- Association (ELRA), Istanbul, Turkey, 2012.
sen, O. Lacroix, DDisCo: A discourse coherence [28] P. J. Ortiz Suárez, B. Sagot, L. Romary,
Asyndataset for Danish, in: Proceedings of the Thir- chronous pipelines for processing huge corpora
teenth Language Resources and Evaluation Con- on medium to low resource infrastructures,
Proference, European Language Resources Associa- ceedings of the Workshop on Challenges in
tion, Marseille, France, 2022, pp. 2440–2445. URL: the Management of Large Corpora (CMLC-7)
https://aclanthology.org/2022.lrec-1.260. 2019. Cardif, 22nd July 2019, Leibniz-Institut
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, für Deutsche Sprache, Mannheim, 2019, pp. 9 –
L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- 16. URL: http://nbn-resolving.de/urn:nbn:de:bsz:
tention is all you need, Advances in neural infor- mh39-90215. doi:10.14618/ids-pub-9021.
mation processing systems 30 (2017). [29] V. Lyding, E. Stemle, C. Borghetti, M. Brunello,
[18] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De- S. Castagnoli, F. Dell’Orletta, H. Dittmann, A. Lenci,
langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- V. Pirrelli, The PAISÀ corpus of Italian web texts, in:
towicz, J. Brew, Huggingface’s transformers: State- Proceedings of the 9th Web as Corpus Workshop
of-the-art natural language processing, ArXiv (WaC-9), Association for Computational
Linguisabs/1910.03771 (2019). tics, Gothenburg, Sweden, 2014, pp. 36–43. URL:
[19] S. Schweter, Italian bert and electra models, https://aclanthology.org/W14-0406. doi:10.3115/
2020. URL: https://doi.org/10.5281/zenodo.4263142. v1/W14-0406.</p>
        <p>doi:10.5281/zenodo.4263142. [30] M. Baroni, S. Bernardini, A. Ferraresi, E. Zanchetta,
[20] L. Parisi, S. Francia, P. Magnani, Umberto: an The wacky wide web: a collection of very large
italian language model trained with whole word linguistically processed web-crawled corpora,
Lanmasking, https://github.com/musixmatchresearch/ guage resources and evaluation 43 (2009) 209–226.
umberto, 2020. [31] P. Laban, L. Dai, L. Bandarkar, M. A. Hearst, Can
[21] M. Mufo, E. Bertino, Bertino: an italian distil- transformer models measure coherence in text:
Rebert model, https://github.com/indigo-ai/BERTino, thinking the shufle test, in: Proceedings of the
2020. 59th Annual Meeting of the Association for
Com[22] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: putational Linguistics and the 11th International
Pre-training of deep bidirectional transformers for Joint Conference on Natural Language Processing
language understanding, in: Proceedings of the (Volume 2: Short Papers), Association for
Computa2019 Conference of the North American Chap- tional Linguistics, Online, 2021, pp. 1058–1064. URL:
ter of the Association for Computational Linguis- https://aclanthology.org/2021.acl-short.134. doi:10.
tics: Human Language Technologies, Volume 1 18653/v1/2021.acl-short.134.
(Long and Short Papers), Association for Com- [32] A. Beyer, S. Loáiciga, D. Schlangen, Is
putational Linguistics, Minneapolis, Minnesota, incoherence surprising? targeted evaluation
2019, pp. 4171–4186. URL: https://aclanthology.org/ of coherence prediction from language
modN19-1423. doi:10.18653/v1/N19-1423. els, in: Proceedings of the 2021 Conference
[23] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, of the North American Chapter of the
Associaa distilled version of bert: smaller, faster, cheaper tion for Computational Linguistics: Human
Lanand lighter, 2019. arXiv:1910.01108. guage Technologies, Association for
Computa[24] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, tional Linguistics, Online, 2021, pp. 4164–4173.</p>
        <p>O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, URL: https://aclanthology.org/2021.naacl-main.328.
Roberta: A robustly optimized bert pretraining ap- doi:10.18653/v1/2021.naacl-main.328.
proach, 2019. arXiv:1907.11692. [33] Y. Wang, M. Guo, A short analysis of discourse
[25] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, coherence, Journal of Language Teaching and
Re</p>
        <p>ELECTRA: Pre-training text encoders as discrimi- search 5 (2014) 460.
[34] E. Reif, A. Yuan, M. Wattenberg, F. B. Viegas, A.
Coenen, A. Pearce, B. Kim, Visualizing and measuring
the geometry of bert, Advances in Neural
Information Processing Systems 32 (2019).
[35] L. Pandia, Y. Cong, A. Ettinger, Pragmatic
competence of pre-trained language models through
the lens of discourse connectives, in: Proceedings
of the 25th Conference on Computational Natural
Language Learning, Association for Computational
Linguistics, Online, 2021, pp. 367–379. URL: https:
//aclanthology.org/2021.conll-1.29. doi:10.18653/
v1/2021.conll-1.29.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>