1. Introduction

IUSSNets at DisCoTeX: A fine-tuned approach to coherence

Emma Zanoli

Matilde Barbini

Cristiano Chesi

0 0 University School for Advanced Studies IUSS Pavia - NeTS Lab

We present our submission to the DisCoTex shared task of the EVALITA 2023 evaluation campaign, which focuses on modeling discourse coherence for Italian texts. We highlight the importance of coherence modeling in natural language processing tasks and briefly discuss related work, including earlier linguistic theories and recent neural models. To tackle the task, we leverage pre-trained Transformer models and fine-tune them on the provided datasets. Our approach incorporates monolingual models due to limited computing resources, but shows potential for multilingual and multitask learning. Our systems ranks second overall, showing that Transformer models can be fruitfully leveraged for coherence assessment, but more work is needed to fully exploit their capabilities. The coherence assessment literature focuses primarily on English; this shared task and our work contribute to broadening the scope of current research.

eol>coherence Transformers NLP computational linguistics

1. Introduction Our implementation choices are informed by [15], who

are among the first to use Transformer models for coherWritten texts are often a sequence of semantically co- ence assessment. herent segments, designed to create a smooth transition It is interesting to note that the literature on coherence between various subtopics [1]. Modeling coherence can ifnds significant overlap with the literature on readability. be done by building text analysis models that can distin- The two are often likened and used as general measures guish a coherent text from incoherent ones, or that can of textual quality [9]. Sometimes, coherence is used as output a coherence score [2]. It has been a key problem an additional feature in readability assessment [12]. in discourse analysis, with applications in many down- By and large, the literature on automatic assessment stream NLP tasks (e.g. text generation, summarization, of discourse coherence focuses on the English language. machine translation, dialogue generation, etc.). One notable exception is [16] for Danish.

Coherence modeling is at the heart of the DisCoTex shared task [3] of the EVALITA 2023 evaluation campaign [4]. This report relates the motivation and implementa- 3. Task tion of the IUSSnets team’s submission.

2. Related work

Early computational models for text coherence assessment were mainly based on one of two linguistic theories: a) centering theory [5] and b) rhetorical structure theory [6]. In line with the first, [ 7] and [8] use the distribution of entity transitions over sentences to predict text coherence. In line with the second, [9] and [10] produce discourse relations over sentences with a discourse parser, showing that the relations are indicative of text coherence.

More recently, neural models have gained prominence in the task of coherence assessment. Popular examples are [11], [12], [13], and the recent state-of-the-art [14].

DisCoTEX is the first shared task focused on modelling discourse coherence for Italian real-word texts. The organizers proposed two sub-tasks: • Sub-task 1 - Last sentence classification : a binary classification task. Given a short paragraph (the prompt), and an individual sentence (the target), the goal is to classify whether the target follows or not, i.e. whether joining it to the prompt gives out a coherent or incoherent text. • Sub-task 2 - Human score prediction: a regression task. The goal is to predict the average coherence score assigned by human raters to short paragraphs. Judgments are expressed on a 5-point Likert Scale. 4. DisCoTex Data analyzed within text passages of four consecutive sen- In the following we provide an overview of the main tences. For task 1, these were split into 8000 prompt- intuition for each model. target pairs for each domain: the prompt is always made BERT by Google [22] introduced “masked language of the first three consecutive sentence, whereas the tar- modeling” (MLM): some of the input tokens were masked, get can either be the actual last sentence of the passage and the pre-training objective was to predict the original (for the positive class) or a diferent one (for the nega- vocabulary id of the masked word based only on its contive class). This dataset is automatically generated. For text. MLM enabled the representation to fuse the left and task 2 there were 1064 text passages, equally balanced the right context, leading to a bidirectional Transformer. across the two original source datasets, of which 50% In addition to MLM, they also used a “next sentence prewere left unaltered and 50% were artificially modified to diction” task that jointly pre-trained text-pair representaundermine coherence. This dataset was not automati- tions. After pre-training, BERT could be fine-tuned with cally generated: each passage was annotated by at least just one additional output layer to create state-of-the-art 10 human evaluators who were native speakers of Italian. models for a wide range of tasks, without substantial task-specific architecture modifications.

DistilBERT by HuggingFace [23] leveraged knowl5. Description of the system edge distillation during the pre-training phase, thus reducing the size of a BERT model by 40%, while retaining 5.1. General intuition 97% of its language understanding capabilities and beFor this challenge we leveraged pre-trained Transformer ing 60% faster. To leverage the inductive biases learned models and fine-tuned them on the provided data. by larger models during pre-training, they introduced a

Transformer models [17] have been applied with triple loss combining language modeling, distillation and tremendous success to the field of NLP. They have been cosine-distance losses. shown to capture semantic relationships to a reasonable RoBERTa by Facebook AI [24] applied various preextent. As reported in Section 2, they have already suc- training enhancements to the original BERT model: cessfully been applied to the task of discourse coherence longer training on longer sequences, bigger batches over modeling. more data, no next sentence prediction objective, and

Since the DisCoTex task is tailored specifically to the dynamically changing the masking pattern applied to the Italian language, we decided to leverage monolingual training data. These modifications advanced the state of Transformer models that had been pre-trained exclu- the art on diferent downstream tasks. sively on Italian data. Given that coherence assessment ELECTRA by Stanford and Google [25] introduced a datasets are available for English, we initially intended new pre-training task called "replaced token detection": to experiment with multi-lingual transfer learning, using instead of masking the input, they corrupted it by remultilingual pre-trained Transformer models and fine- placing some tokens with plausible alternatives sampled tuning them simultaneously on English and Italian data. from a small generator network. Then, instead of predictUnfortunately, our limited computing resources did not ing the original identities of the corrupted tokens, they allow us to get this far within the time frame of the shared trained a discriminative model that predicts whether each task. Preliminary results indicate that this would have token in the corrupted input was replaced by a generator been a promising approach. sample or not. The model showed competitive performance compared to other models, while requiring fewer resources for training. 5.2. Pre-trained models As previously stated, we experimented with monolinWe experimented with 4 monolingual pre-trained models, gual Italian versions of these models, i.e. models that freely available on the HuggingFace hub [18] at the time were trained using the same approaches as the ones deof writing: scribed above, but solely on Italian data. These models were used to encode the input and return a vector representation from the last layer output (i.e. the [CLS] token, which was taken to signify a vector representation of the sentence). • bert-ita1: an Italian version of BERT [19]; • electra-ita2: an Italian version of ELECTRA

[19]; • umberto3: an Italian version of RoBERTa [20]; • bertino4: an Italian version of DistilBERT [21]. 5.3. Fine-tuning 1https://huggingface.co/dbmdz/bert-base-italian-xxl-cased The pre-trained models were fine-tuned on the available 2https://huggingface.co/dbmdz/electra-base-italian-xxl-cased-discrimindaatotra for 10 epochs, using the following hyper-parameters: 3uhmttpbse:r/t/oh-ucgogminmgofanccer.acwo/lM-cuasseixdm-va1tch/ 0.1 dropout rate, 0.01 weight decay, 1e-6 learning rate, a 4https://huggingface.co/indigo-ai/BERTino batch size of 1 and no gradient clipping. We used the cur

Training data

Wikipedia, OPUS [27], OSCAR [28] Wikipedia, OPUS [27], OSCAR [28] OSCAR [28] - deduplicated PAISÀ [29], ItWaC [30]

During fine-tuning, we only relied on the provided datasets. However, we used Transformer models which had been pre-trained on a variety of data sources (see Table 1).

For sub-task 2 we attempted some data augmentation techniques. Since we had a dataset were each sentence had a mean score based on at least 10 judgments, we leveraged the standard deviation to generate a distribution of 10 scores that would have the provided mean and 6.1. Sub-task 1 - evaluation results standard deviation. We thus ended up with 10 scores for each sentence, instead of an average score. However, In the absence of a test or validation set, we sampled 20% upon training our models on this augmented dataset, we of the original training sets for preliminary evaluation. did not notice any significant improvements and, because This resulted in 1600 randomly sampled data points for this approach was more resource-intensive, we eventu- each dataset. On these sub-sets, we calculated the binary ally dropped it. accuracy as implemented in the torchmetrics Python

Please note that we only made use of 80% of the pro- library5. We report results in Table 4. vided datasets during fine-tuning; the remaining 20% was used as a validation split (more details below). 6.2. Sub-task 2 - evaluation results

6. Results For the purposes of the oficial rankings, our results are:

0.72 on sub-task 1, 0.63 on sub-task 2.

For sub-task 1, the organizers considered the accuracy of the best run and computed the mean between the best results on the two datasets (Ted and Wiki). For sub-task 2, they first computed both Pearson and Spearman correlations, then they applied the harmonic mean between the two measures. Participants were allowed to submit

In the absence of a test or validation set, we sampled 20%

of the original training set for preliminary evaluation. This resulted in 172 randomly sampled data points. On this sub-set, we computed the Spearman correlation coeficient as implemented in the scipy Python library6. We report results in Table 5.

5https://torchmetrics.readthedocs.io/en/stable/classification/

accuracy.html#binaryaccuracy 6https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats. spearmanr.html Model bert-ita electra-ita umberto bertino bert-ita electra-ita bert-ita electra-ita

Dataset wiki wiki wiki wiki all all ted ted

Accuracy 0.749 0.716 0.595 0.637 0.723 0.583 0.704 0.617

7. Discussion

ciation for its coherence are driven by active inference, background knowledge, and a degree of imagination.

It comes as no surprise, then, that the many facets of this uniquely human experience are hard to model computationally. In order to get a sense for this, we looked into the dataset collected for sub-task 2. Overall, the majority of the training dataset contained texts rated 3.0 or higher; in other words, the texts were perceived as mostly coherent. It would have been interesting to compare how the annotators rated original vs. artificially modified text passages. Although we did not have this information in the dataset, when comparing the datasets for sub-task 1 and 2, we found 19 passages in the dataset for sub-task 2 in the positive class of the ted dataset for sub-task 1: this means that these passages had not been modified from their original sources and were thus expected to be coherent. Of these 19 passages: • none were unanimously rated as coherent, i.e. a mean score of 5 (0%); • 4 received a mean score of 4 or above (21%); • 10 received a rating between 3 and 4 (53%); • 4 received a rating between 2 and 3 (21%); • 1 even received a rating below 2 (5%).

IUSSnets

baseline

The DisCoTex shared task provided us with an excellent

opportunity to reflect on the notion of disourse coherence and on the ways it may be assessed, whether automati- If we were to revert these scores back to a binary classically or not. ifcation (with a halfway cutof at 2.5), 5 of these passages

As a preamble, let us note that datasets for coher- would be considered incoherent. However, for the purence assessment that are automatically created by shuf- poses of sub-task 1, they would have been considered lfing existing texts have been criticized, among others, coherent. This simplistic example is in no way an exby [31] and [32], and the models trained on them have haustive exploration of the nature of the tasks or the been shown to perform weakly on downstream tasks provided datasets, but it serves the purpose of reflecting [2]. Nonetheless, such datasets have remained common on the dificulty of modeling these phenomena from a benchmarks. more explicit (linguistic or cognitive) perspective.

Discourse coherence is a complicated concept that is re- Deep learning models generally, and Transformers lated to almost every aspect of discourse communication. specifically, have been shown to capture useful semantic In the linguistics literature, there is no all-embracing rule information in texts. Previous work has investigated governing coherence analysis: diferent scholars have Transformers for their semantic [34] and even pragmatic presented their insight into diferent aspects of discourse [35] properties. For these reasons, we hypothesized that coherence [33]. When we read a text or listen to speech, Transformer models would be a good fit for the task of we are inclined to infuse it with coherence by making coherence assessment. Indeed, even in our simple setup, our own inferences based on our understanding and per- we can see promising results. Further experimentation ception. Coherence is therefore achieved not by using and greater computational power could lead to significant superficial markers such as linguistic or grammatical de- performance improvements. Multilingual and multi-task vices, but through psychological, cognitive, or pragmatic learning might prove particularly efective in boosting means. The comprehension of discourse and an appre- performance on Italian texts by leveraging datasets that exist for the English language or for other related tasks.

Acknowledgments We are thankful to the bright research community at

the NeTS lab of IUSS Pavia, who encouraged and supported these experiments. This research has been partially founded by the PON Governance 2014-2020: Next Generation UPP Project - CUP D19J22000240006.

Moving forward, further exploration of linguistic theories and neural models can enhance discourse coherence assessment and facilitate more sophisticated language processing applications. Focusing on more controlled textual continuations (e.g. diferent logical conclusions from specific premises) would shed some light on the relevance of specific factors in coherence modeling. This would also allow us to better understand the strengths and weaknesses of a transformers-based approach. ral graph-based local coherence model, in: Find- nators rather than generators, in: ICLR, 2020. URL: ings of the Association for Computational Lin- https://openreview.net/pdf?id=r1xMH1BtvB. guistics: EMNLP 2021, Association for Computa- [26] D. P. Kingma, J. Ba, Adam: A method for stochastic tional Linguistics, Punta Cana, Dominican Repub- optimization, 2017. arXiv:1412.6980. lic, 2021, pp. 2316–2321. URL: https://aclanthology. [27] J. Tiedemann, Parallel data, tools and interfaces in org/2021.findings-emnlp.199. doi: 10.18653/v1/ opus, in: N. C. C. Chair), K. Choukri, T. Declerck, 2021.findings-emnlp.199. M. U. Dogan, B. Maegaard, J. Mariani, J. Odijk, [15] T. Abhishek, D. Rawat, M. Gupta, V. Varma, Trans- S. Piperidis (Eds.), Proceedings of the Eight Internaformer models for text coherence assessment, 2022. tional Conference on Language Resources and EvalarXiv:2109.02176. uation (LREC’12), European Language Resources [16] L. Flansmose Mikkelsen, O. Kinch, A. Jess Peder- Association (ELRA), Istanbul, Turkey, 2012. sen, O. Lacroix, DDisCo: A discourse coherence [28] P. J. Ortiz Suárez, B. Sagot, L. Romary, Asyndataset for Danish, in: Proceedings of the Thir- chronous pipelines for processing huge corpora teenth Language Resources and Evaluation Con- on medium to low resource infrastructures, Proference, European Language Resources Associa- ceedings of the Workshop on Challenges in tion, Marseille, France, 2022, pp. 2440–2445. URL: the Management of Large Corpora (CMLC-7) https://aclanthology.org/2022.lrec-1.260. 2019. Cardif, 22nd July 2019, Leibniz-Institut [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, für Deutsche Sprache, Mannheim, 2019, pp. 9 – L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- 16. URL: http://nbn-resolving.de/urn:nbn:de:bsz: tention is all you need, Advances in neural infor- mh39-90215. doi:10.14618/ids-pub-9021. mation processing systems 30 (2017). [29] V. Lyding, E. Stemle, C. Borghetti, M. Brunello, [18] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De- S. Castagnoli, F. Dell’Orletta, H. Dittmann, A. Lenci, langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- V. Pirrelli, The PAISÀ corpus of Italian web texts, in: towicz, J. Brew, Huggingface’s transformers: State- Proceedings of the 9th Web as Corpus Workshop of-the-art natural language processing, ArXiv (WaC-9), Association for Computational Linguisabs/1910.03771 (2019). tics, Gothenburg, Sweden, 2014, pp. 36–43. URL: [19] S. Schweter, Italian bert and electra models, https://aclanthology.org/W14-0406. doi:10.3115/ 2020. URL: https://doi.org/10.5281/zenodo.4263142. v1/W14-0406.

doi:10.5281/zenodo.4263142. [30] M. Baroni, S. Bernardini, A. Ferraresi, E. Zanchetta, [20] L. Parisi, S. Francia, P. Magnani, Umberto: an The wacky wide web: a collection of very large italian language model trained with whole word linguistically processed web-crawled corpora, Lanmasking, https://github.com/musixmatchresearch/ guage resources and evaluation 43 (2009) 209–226. umberto, 2020. [31] P. Laban, L. Dai, L. Bandarkar, M. A. Hearst, Can [21] M. Mufo, E. Bertino, Bertino: an italian distil- transformer models measure coherence in text: Rebert model, https://github.com/indigo-ai/BERTino, thinking the shufle test, in: Proceedings of the 2020. 59th Annual Meeting of the Association for Com[22] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: putational Linguistics and the 11th International Pre-training of deep bidirectional transformers for Joint Conference on Natural Language Processing language understanding, in: Proceedings of the (Volume 2: Short Papers), Association for Computa2019 Conference of the North American Chap- tional Linguistics, Online, 2021, pp. 1058–1064. URL: ter of the Association for Computational Linguis- https://aclanthology.org/2021.acl-short.134. doi:10. tics: Human Language Technologies, Volume 1 18653/v1/2021.acl-short.134. (Long and Short Papers), Association for Com- [32] A. Beyer, S. Loáiciga, D. Schlangen, Is putational Linguistics, Minneapolis, Minnesota, incoherence surprising? targeted evaluation 2019, pp. 4171–4186. URL: https://aclanthology.org/ of coherence prediction from language modN19-1423. doi:10.18653/v1/N19-1423. els, in: Proceedings of the 2021 Conference [23] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, of the North American Chapter of the Associaa distilled version of bert: smaller, faster, cheaper tion for Computational Linguistics: Human Lanand lighter, 2019. arXiv:1910.01108. guage Technologies, Association for Computa[24] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, tional Linguistics, Online, 2021, pp. 4164–4173.

O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, URL: https://aclanthology.org/2021.naacl-main.328. Roberta: A robustly optimized bert pretraining ap- doi:10.18653/v1/2021.naacl-main.328. proach, 2019. arXiv:1907.11692. [33] Y. Wang, M. Guo, A short analysis of discourse [25] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, coherence, Journal of Language Teaching and Re

ELECTRA: Pre-training text encoders as discrimi- search 5 (2014) 460. [34] E. Reif, A. Yuan, M. Wattenberg, F. B. Viegas, A. Coenen, A. Pearce, B. Kim, Visualizing and measuring the geometry of bert, Advances in Neural Information Processing Systems 32 (2019). [35] L. Pandia, Y. Cong, A. Ettinger, Pragmatic competence of pre-trained language models through the lens of discourse connectives, in: Proceedings of the 25th Conference on Computational Natural Language Learning, Association for Computational Linguistics, Online, 2021, pp. 367–379. URL: https: //aclanthology.org/2021.conll-1.29. doi:10.18653/ v1/2021.conll-1.29.