Towards an Automatic Evaluation of (In)coherence in
                                Student Essays
                                Filippo Pellegrino1,* , Jennifer Carmen Frey1 and Lorenzo Zanasi1
                                1
                                    Eurac Research Institute, Viale Druso Drususallee, 1, 39100 Bolzano, Autonome Provinz Bozen - Südtirol


                                                  Abstract
                                                  Coherence modeling is an important task in natural language processing (NLP) with potential impact on other NLP tasks
                                                  such as Natural Language Understanding or Automated Essay Scoring. Automatic approaches in coherence modeling aim to
                                                  distinguish coherent from incoherent (often synthetically created) texts or to identify the correct continuation for a given
                                                  sample of texts, as demonstrated for Italian in the DisCoTex task of EVALITA 2023. While early work on coherence modelling
                                                  has focused on exploring definitions of the phenomenon, exploring the performance of neural models has dominated
                                                  the field in recent years. However, coherence modelling can also offer interesting linguistic insights with pedagogical
                                                  implications. In this article, we target coherence modeling for the Italian language in a strongly domain-specific scenario,
                                                  i.e. education. We use a corpus of student essays collected to analyse students’ text coherence in combination with data
                                                  perturbation techniques to experiment with the effect of various linguistically informed features of incoherent writing on
                                                  current coherence modelling strategies used in NLP. Our results show the capabilities of encoder models to capture features
                                                  of (in)coherence in a domain-specific scenario discerning natural from artificially corrupted texts.

                                                  Keywords
                                                  Coherence modelling, data perturbation, transformers, education, student essays


                                1. Introduction                                                                                            evaluation for student essays. While large language mod-
                                                                                                                                           els have been used successfully in domain general coher-
                                Argumentative essay writing is a fundamental objective                                                     ence modelling before, we test their effectiveness for text
                                in education for both vocational schools and high schools                                                  analysis in this domain-specific scenario, taking into ac-
                                in Italy, as indicated in [1, 2]. It requires students to                                                  count both surface and non-standard language features.
                                present arguments supported by personal knowledge or                                                       We discuss:
                                external sources in a coherent and convincing manner.
                                However, writing coherent texts poses both cognitive                                                            • data perturbation techniques to artificially repro-
                                and linguistic challenges to novice writers and textual                                                           duce real-life scenario incoherence in textual data
                                competences related to it are frequently claimed to be                                                          • a custom probing task design
                                insufficient, putting pressure on the educational system.                                                       • automatic evaluation of coherence using different
                                Automatically discerning incoherent texts or passages                                                             encoding models
                                could help teachers to better understand students’ prob-
                                lems and give targeted instructions, while students would            The results of our experiments show the performances of
                                benefit from more frequent and more timely feedback.                 encoder models in recognizing patterns of (in)coherence
                                However, to date, most NLP research in automatic coher-              in a domain-specific educational context such as upper
                                ence modelling focused on semantic similarity between                secondary school student essays. The paper is organized
                                two parts of texts using mostly well-formed newspaper                as follows: Section 2 provides an overview of previous
                                or Wikipedia texts, offering little information for educa-           approaches to coherence modelling and NLP data pertur-
                                tional contexts.                                                     bation with a focus on Italian NLP. Section 3 introduces
                                In this study, we explore coherence from an educational              the data we used for this study, giving information on
                                perspective, utilizing recent language models and data               the research project it originates in as well as on the cor-
                                perturbation techniques to probe their value for linguis-            pus design and annotation. Section 4 provides a detailed
                                tically informed and informative automatic coherence                 description of our methodology introducing our custom
                                                                                                     probing tasks (Section 4.1), used Models (Section 4.2.1)
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
                                                                                                     and text encoding 4.3 as well as a description of the two
                                Dec 04 — 06, 2024, Pisa, Italy                                       analyses performed (Section 4.4 and Section 4.5). Sec-
                                *
                                  Corresponding author.                                              tions 5 and 6 present and discuss our results and Section
                                $ filippo.pellegrino.job@gmail.com (F. Pellegrino);                  7 concludes the article with final considerations.
                                jennifercarmen.frey@eurac.edu (J. C. Frey);
                                lorenzo.zanasi@eurac.edu (L. Zanasi)
                                 0000-0002-7008-6394 (J. C. Frey); 0000-0002-4439-6567
                                (L. Zanasi)
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Related Work                                               3.1. ITACA Corpus
                                                              The ITACA corpus1 is an annotated learner corpus cre-
2.1. Coherence modelling
                                                              ated within the project ITACA: Coerenza nell’ITAliano
Coherence modeling is an important task in natural lan-       Accademico [28]. It consists of a total of 636 argumenta-
guage processing (NLP) with potential impact on other         tive essays from Italian L1 upper secondary school stu-
NLP tasks such as Natural Language Understanding              dents from the autonomous province of Bolzano/Bozen2
or automated essay scoring. Early work on coherence           during the school year 2021/2022. The texts were col-
modelling focused on the definition of the phenomenon         lected by asking 12th grade students to type an argumen-
[3, 4, 5, 6, 7] and provides valuable frameworks such as      tative essay following precise indications of writing time,
Centering Theory [8, 9] and Entity-Grid approach [10].        text length and topic. The full assignment can be con-
Following the great development of neural network sys-        sulted in the Appendix B. While the assignment asked for
tems in recent years, many works such as [11, 12, 13, 14]     a minimun text length of 600 words, the average number
explored coherence modelling implementing further and         of tokens in the essay is with 668, just slightly above the
more sophisticated solutions for the English language.        minimum length requirement.
Recently, the Italian NLP community has approached            The totality of the 636 collected texts constitutes 382,964
the topic from an engineering point of view, using Ital-      tokens. All data were collected digitally and anony-
ian pre-trained neural models to distinguish coherent         mously and underwent subsequent control and cleaning
from (mainly synthetically constructed) non-coherent          procedures, partly manually, to ensure their integrity
texts [15, 16, 17, 18]. Some efforts were also made for       and to guarantee the anonymity of the participants. Es-
multilingual scenarios [19] demonstrating the encoding        says were collected, by asking students to type their es-
capabilities of multilingual models for coherence features.   says into an input field in an online form, additional
                                                              metadata was collected by a subsequent online question-
2.2. Data perturbation                                        naire asking for basic socio-demographic information,
                                                              students’ language background, and reading and writing
In data perturbation, dataset entries are corrupted with      habits. The whole corpus was automatically tokenized,
specific computational operations to simulate noise con-      lemmatized and annotated for part-of-speech and syntac-
dition and test the model performance on real world con-      tic dependencies with the support of project collaborators
ditions [20]. Many studies on data perturbation and data      from Fondazione Bruno Kessler, who also supported the
augmentation in NLP focus on model agnostic methods           project in the setup of an interface for manual annotation
[20, 21, 22, 23] using random deletion, random swap, syn-     based on Inception[29].
onym replacement, random insertion and punctuation               A manual annotation of a subset of 388 texts was per-
insertion techniques for text classification with limited     formed by two trained annotators and offers detailed
amount of data. More sophisticated and task-oriented          descriptions of the text’s structure, with a focus on the
data augmentation approaches are proposed for senti-          use of various linguistic features (such as punctuation,
ment analysis [24], hate speech classification [25], hyper-   connectives, agreements, anaphora, contradictions) that
nymy detection [26] and domain specific classification        enhance or limit the text’s cohesion and coherence.
[27].                                                         The manual annotation of the corpus was guided by the
                                                              three sections elaborated in [30] and contained annota-
                                                              tions for traits of incoherence referring to
3. Data
                                                                      1. segmentation (e.g. splice comma, added comma,
The data used in this study originates from a research                   not-signed parenthetical clause)
project, conducted in South Tyrol between 2020 and 2024.              2. logic-argumentative plan (e.g. issues in the use
The project named ITACA: Coerenza nell’ITAliano Ac-                      of connectives, contradictions)
cademico [28] had the aim to study textual competences                3. thematic-referential plan (e.g. critical agreement,
of students in their first language Italian with particular              critical anaphora, not-expanded comment)
focus on aspects of text coherence. Within the project
various outcomes have been produced: a corpus of Italian      The corpus is accessible through an ANNIS search inter-
student essays collected in Italian South Tyrolean upper      face 3 and can be downloaded in various formats from the
secondary schools, a validated rating scale to evaluate       Eurac Research Clarin Center (ERCC) under the CLARIN
coherence in student essays, and coherence ratings for        ACADEMIC END-USER LICENCE ACA-BY-NC-NORED
texts in the corpus from three independent raters using
the previously developed rating scale. The products are       1
                                                                  https://www.porta.eurac.edu/lci/itaca/
described in the following section.                           2
                                                                  texts are collected in Bolzano, Bressanone, Merano and Brunico
                                                              3
                                                                  https://commul.eurac.edu/annis/itaca
1.0 licence 4 . Downloads and further documentation can features throughout the whole essay, but only struggle
also be accessed via Eurac Research’s PORTA platform5 . occasionally (e.g. not all connectives are semantically
                                                            incorrect), we reduced the perturbation ratio to 50%
3.2. Manual coherence ratings                               in Pronoun Perturbation, Splice Comma Perturbation
                                                            and Parataxis Perturbation in order to create realistic
Each single essay was additionally manually evaluated conditions and increase the difficulty of the single tasks.
in a double-blind manner by a panel of six experts who Although data perturbation can also operate on the
applied a specially created, rating scale, which was subse- character level, we opted for token- and sentence-level
quently validated to assess textual coherence. The items approaches maintaining parameters in a controlled
were rated on a Likert scale from one to ten and referred setting.
to three dimensions of coherence (structure, comprehen-
sibility, segmentation). The average structure score 𝜇 is We implemented the following custom probing
attested at 4.55 with standard deviation 𝜎 = 5. For compre- tasks:
hensibility, 𝜇 = 6.29 and 𝜎 = 1.65, while for segmentation
𝜇 = 5.99 and 𝜎 = 1.79.                                      Sentence Order Perturbation [SHUFF]:
                                                            As in other synthetic datasets for coherence modelling
                                                            [15] this data perturbation technique is to randomly
4. Methodology                                              shuffle sentences within the texts.
In this study, we focus on NLP data perturbation [20, 21]
and custom probing tasks [31] to evaluate the ability of     Connective Perturbation [LICO]:
Italian BERT models of discerning features of coherence      In order to imitate texts in which the logical connection
given different pre-training conditions and fine tuning.     between phrases is erroneous, we randomly substituted
In our analysis, we aim to evaluate automatic coherence      connectives used in the text exploiting both manual
modelling techniques, applying them to student essays        and automatic processing with Stanza6 ; To identify
with varying degrees of well-formedness and coherence.       the connectives to substitute, we referred to a string
We conducted a number of experiments probing whether         matching of all connectives listed in the Lexicon of
state-of-the-art coherence modelling techniques based        Italian Connectives (LICO) [33].
on BERT encodings would be able to distinguish between
original, i.e. allegedly coherent texts and those contain-   Polyfunctional Connective Perturbation [POLY-
ing features of incoherence identified for student writing   FUNCT]:
before. In our case study, we use data perturbation tech-    Based on the ITACA corpus annotation scheme, we
niques to reproduce specific students’ errors observed       implement a probing task, imitating young writers
during the textual analysis of the ITACA project [28] (see   tendency to use simple polifunctional connectives
Section 3), in order to apply text modification in a fully   instead of highly semantically loaded ones. For this, we
controlled fashion. We used representations obtained         substitute all connectives in the text by the polyfunc-
from BERT [32] models to demonstrate the ability of au-      tional connective "e".
tomatic systems to encode patterns of (in)coherence in
a specialized scenario such as Italian student essays and    Pronoun Perturbation [PRON]:
evaluate their potential for educational purposes.           For a very simplistic approximation of corrupted
                                                             anaphoric references, we identified pronouns with
                                                             Stanza and replaced them randomly by other pronouns
4.1. Custom Probing Tasks                                    isoleted from the corpus. To ensure a minimum of
Using data perturbation techniques, we aim to reproduce      correct pronouns, only 50% of the pronouns in the text
both general-purpose coherence modelling perturbation        were corrupted.
strategies and modifications inspired by some of the
most salient features of textual (in)coherence observed      Splice Comma Perturbation [SPLICE]:
in the annotation process for the ITACA project. These       A splice comma is the use of a comma to join two
include incoherent order of arguments and sentences,         independent sentences. The comma can substitute
incorrect use of connectives, overuse of polyfunctional      a dot, a colon, or semicolon [34, 35, 36, 37]. In our
connectives, unresolved co-reference, the use of splice      case, long pause markers such as periods, colons, or
comma and an overuse of paratactical constructions.          semicolons were substituted with a comma. We apply
Assuming that students would not produce the these           the perturbation to just 50% of the conjunctions in the
                                                             text to partially keep punctuation unaltered.
4
    http://hdl.handle.net/20.500.12124/76
5                                                            6
    https://www.porta.eurac.edu/itaca                            https://stanfordnlp.github.io/stanza/
    Perturbation                                Example Sentence
    None                                        Stamattina io sono andato al mercato. Ho comprato delle mele e delle arance. Poi
                                                sono tornato a casa e ho preparato una torta.
    Sentence Order Perturbation                 Poi sono tornato a casa e ho preparato una torta. Stamattina io sono andato al mercato.
                                                Ho comprato delle mele e delle arance.
    LICO Connective Perturbation                Stamattina io sono andato al mercato. Ho comprato delle mele e delle arance. Poi
                                                sono tornato a casa invece di ho preparato una torta.
    Polyfunctional Connective Perturbation      Stamattina io sono andato al mercato. Ho comprato delle mele e delle arance. e sono
                                                tornato a casa e ho preparato una torta.
    Pronoun Perturbation                        Stamattina noi sono andato al mercato. Ho comprato delle mele e delle arance. Poi
                                                sono tornato a casa e ho preparato una torta.
    Splice Comma Perturbation                   Stamattina io sono andato al mercato, Ho comprato delle mele e delle arance, Poi sono
                                                tornato a casa e ho preparato una torta.
    Parataxis Perturbation                      Stamattina io sono andato al mercato. Ho comprato delle mele, delle arance. Poi sono
                                                tornato a casa. ho preparato una torta.
Table 1
Example Sentences under Text Perturbations. The example corresponds to the English "This morning I went to the market. I
bought some apples and oranges. Then I went back home and baked a cake"


                                                                  says typologically similar to our dataset, thankfully pro-
Parataxis Perturbation [PARATAX]:                                 vided for this purpose by the Fondazione Bruno Kessler
Coordinating conjunctions extracted with Stanza are               (FBK). The number of essays employed for the fine-tuning
substituted with punctuation taken from a list to create          corresponds to 2096 dataset entries with a mean text
paratactic sentences. We apply the perturbation to                length of 705 tokens. Fine-tuning our BERT model al-
just 50% of the conjunctions in the text to keep some             lowed us to provide further contextual and text essay
conjunctions untouched.                                           style information to the pre-trained model, increasing
                                                                  the model’s ability in domain-specific text representation.
Text perturbation examples can be consulted in                    The provided hyperparameter configuration for training
Table 1                                                           is: truncation = max length, padding = max length, batch
                                                                  size = 16, learning rate = 5e-5 and epochs = 2. The model
4.2. Models                                                       is trained on both Masked Language Modeling and Next
                                                                  Sentence Prediction tasks [32]. Taking into account the
4.2.1. Pre-trained Models                                         limited amount of data and the relatively quick training
                                                                  time, we use the L4 GPU available in Google Colab10 (pro
For our experiments, we test three different BERT-based
                                                                  version).
models to obtain vector representations for our probing
tasks.
                                                                  4.3. Text Encoding
      1. BERT-ita base [38]: trained with Italian data from
         the OPUS corpora collection7 and Wikipedia8 .The         We retrieved vector representations and performed a bi-
         final training corpus has a size of 13GB and             nary text classification experiment for each perturbation
         2,050,057,573 tokens.                                    technique11 . The model is fed with batch size = 1 with
      2. GilBERTo9 : RoBERTa based model [39]. The                all the texts contained in the set. To overcome the length
         model is trained with the subword masking tech-          input limit of 512 tokens imposed by BERT models and
         nique for 100k steps managing 71GB of Italian            process the entire text in a row with no loss of contextual
         text with 11,250,012,896 words [40]. The team            information, we split the text into two segments when
         took up a vocabulary of 32k BPE subwords, gen-           reached the max input lenght. Furthermore, we adopted a
         erated using SentencePiece tokenizer [41].               mean-pooling strategy by calculating the mean between
                                                                  the last hidden state of each contextualized token em-
4.2.2. BERT-ita Fine-tuning                                       bedding in the batch across the input sequence length.
                                                                  The final text representation is the mean of all segment
Inspired by the works of [42] and [43], the BERT-ita              embeddings in the batch.
model was fine-tuned using a dataset of high school es-
7                                                                 10
  https://opus.nlpl.eu/                                                https://colab.research.google.com/
8                                                                 11
  https://it.wikipedia.org/wiki/Pagina_principale                      The code for this part of the project was written with the help of
9
  https://github.com/idb-ita/GilBERTo?tab=readme-ov-file               the AI tool Chat GPT.
4.4. Model Performance Analysis
We first perform a model performance analysis, compar-
ing the model performance in classification for each of
the custom probing tasks with each of the three mod-
els. Classification is performed with a Random Forest
classifier [44], defining each experiment as a binary clas-
sification between the original and perturbated texts. The
classes were balanced across the entire dataset. To opti-
mize the amount of available data for training and testing,
we use 10-fold cross-validation for evaluation. We com-
pare model performance against a majority class baseline
(0.5 for balanced binary classification) and against each
other using f1 scores.

4.5. Error Analysis
In a subsequent analysis, we compare the model pre-
                                                              Figure 1: Model performances comparison on single probing
dictions of our best-performing model with the human
                                                              tasks
coherence ratings provided for the corpus. In order to
obtain a single coherence score for each essay, the scores
were averaged over the different annotators and the three
components (structure, comprehensibility and segmen-          not expect these differences to be significant. Except for
tation; see Section 3). We perform an error analysis by       the improvement in the shuffling task after fine-tuning,
comparing the predictions for unmodified texts with the       the ITACA-bert model remains comparable to its base
highest and lowest coherence scores using a random for-       version, probably due to the scarcity of domain-specific
est classifier trained with the model that achieved the       training data. Results showed that models achieved bet-
best results in the model comparison. Assuming that all       ter performance on semantic tasks such as polyfunctional
tasks have the same weight, we select the best perform-       conjunction perturbation or pronoun perturbation while
ing model according to the average f1 score achieved in       struggling with syntactic probing tasks such as shuffling
the model performance analysis (see Section 4.4). The         and splice comma perturbation. For the shuffling task,
train set for this evaluation corresponds to 90% of the       a considerable improvement can be observed after fine-
data, while the test set represents the 5% of essays with     tuning (+0.12% from F1 = 0.38 to F1 = 0.50). However,
the highest (𝜇 = 8.28, 𝜎 = 0.36) and the 5% with the lowest   neither of the shuffling models performs better than a
coherence scores (𝜇 = 2.63, 𝜎 = 0.51). Finally, we inter-     random baseline, while the splice comma experiment
pret the results, manually investigating texts that were      models performed slightly better, with the BERT-ita and
misclassified as modified texts from both tails of the test   Gilberto models marginally beating the baseline of 0.5. A
set.                                                          graphical comparison between model performances can
                                                              be seen in Figure 1.
                                                              A detailed overview of the classification results for single
5. Results                                                    tasks and models can be found in the Appendix A. The ta-
                                                              bles provide measures of the f1 score for each experiment
The classification experiments show the ability of the        and model.
BERT models to encode the features of (in)coherence
represented by the perturbation techniques introduced in
Section 4.1. The following sections illustrate our findings   5.2. Error analysis on evaluation set
for the BERT model comparison and the error analysis      To better observe the encoding and classification perfor-
conducted on a selected subset of non-modified texts.     mance of BERT, we decide to isolate the texts with the
                                                          highest and the lowest coherence scores according to the
5.1. Models Comparison Analysis                           average coherence scores as specified in 4.5. The result-
                                                          ing test set corresponds roughly to the 10% of the total
F1 scores for most models were very similar with just number of texts in the corpus. Our expectation is that
small differences between the three models. In average, texts with lower coherence scores have a higher chance
GilBERTo was found to be the best performing model for to be misclassified as modified texts, while texts with
most tasks, probably due to its higher amount of training higher coherence scores should not lead the classifiers
data and its lighter model architecture. However, we do to identify traits of incoherence as specified in the cus-
                                                               for any type of data set of unknown quality that is sub-
                                                               ject to automatic coherence evaluation. Thus, before the
                                                               evaluation, texts have not been subjected to any review
                                                               and, excluding other external factors, they reproduce
                                                               real-world writing conditions. The results of language
                                                               encoding and classification depend on the difficulty of
                                                               the perturbation task and on the original training of the
                                                               BERT model. However, despite the fact that the BERT-ita
                                                               base and GilBERTo exploit different training strategies,
                                                               no drastic performance fluctuations have been observed
                                                               on our selected language tasks. Even though the effects
                                                               of fine-tuning with domain-specific data is limited to the
Figure 2: Classification results on evaluation set. The figure amount of affordable data, the effect can already be ob-
shows the amount of misclassified labels for the essays that
                                                               served by looking at the increment on the shuffling task
lie in the highest and lowest tail of the score ranking ITACA
dataset.                                                       performance.
                                                               The classification of the evaluation set highlighted the
                                                               potential of data perturbation techniques for the encod-
                                                               ing of (in)coherence features. Previous approaches to
tom probing tasks. We perform all analysis using the coherence modelling implemented solutions inspired by
GilBERTo model for text encoding, as it was revealed to theoretical intuitions. In our case, we decided to start
be the best performing model when averaging f1 scores from natural textual errors and check the ability of the
on all tasks of the model performance analysis (see Sec- model in capturing the same features presented in the
tion 4.4). However, we exclude the shuffling task as model text. For a more transparent interpretation of results and
performance was below the baseline and therefore too explanation of individual classification it would be of
low for interpretation. Thus, we train a random forest interest to check how attention maps change according
classifier with the 90% of the train set, for all custom to the tuning of the model [45].
probing tasks described in Section 4.1.
Our results show that the distribution of misclassified
labels is generally skewed toward texts with lower coher- 7. Conclusion
ence scores, but misclassifications for texts with higher
coherence scores were also found. While the splice In this paper, we presented an evaluation of coherence
comma and polyfunctional conjunction (see Figure 2) modelling techniques for detecting incoherence in stu-
probing tasks showed clearly more misclassifications on dent essays based on surface-level features of incoher-
the lower tail of the dataset, also well-rated texts were ence. We used the ITACA corpus of Italian upper sec-
occasionally misclassified as perturbed texts. On the ondary school essays to perform a number of classifica-
contrary, the small number of misclassifications on the tion techniques using data perturbation and BERT-based
parataxis and pronoun perturbation probing tasks might text encoding methods. After a preliminary comparison
suggest that the operationalizations taken in this work between pre-trained and fine-tuned models we adopted
are too simplistic to be representative of students’ mis- the best performing one according to our results. The
takes in the texts and, therefore, not able to pick up on results of the chosen tasks are influenced by the imple-
traits of incoherence present in the students’ essays. The mentation of the perturbation technique, the encoding
results of the experiment can be consulted in Appendix ability of the model, and the amount and the quality of
A.                                                             the data the model is pre-trained on. The best perfor-
                                                               mances are bounded to the model pre-trained with the
                                                               highest amount of data (GilBERTo). We based our evalu-
6. Discussion                                                  ation on simple f1 measures considering this sufficiently
                                                               indicative of the encoding ability of the model applied to
Although data perturbation cannot fully reproduce the each specific probing task.
variability of real-word students’ mistakes, our results Since we mainly tested custom perturbation techniques
give precious insights about the ability of BERT encoders and the encoding abilities of BERT models, future re-
to capture degrees of coherence on both syntactic and search directions might involve data perturbation tech-
semantic level. Of course, the efficiency of the data per- niques enhancement, XAI techiques for model behaviour
turbation might be influenced by several factors, such analysis [46, 45] and the exploitation of state-of-the-art
as the fact that the original texts used for our experi- generative one shot and few-shot models in a highly
ments already naturally contain errors of the same or domain-specific scenario such as school essays writing.
other types. However, we argue that this is the case
Acknowledgments                                                      Radicioni, A. A. Ravelli, et al., Discotex at evalita
                                                                     2023: overview of the assessing discourse coher-
We thank Fondazione Bruno Kessler Trento for their sup-              ence in italian texts task, in: CEUR WORKSHOP
port on the ITACA corpus and for allowing us to use                  PROCEEDINGS, volume 3473, CEUR, 2023, pp. 1–8.
their student essay dataset for fine-tuning.                    [16] M. Galletti, P. Gravino, G. Prevedello, Mpg at dis-
                                                                     cotex: Predicting text coherence by treebased mod-
                                                                     elling of linguistic features, in: Proceedings of the
References                                                           Eighth Evaluation Campaign of Natural Language
 [1] d. e. d. R. Ministero dell’Istruzione, Indicazioni              Processing and Speech Tools for Italian. Final Work-
     nazionali per i licei, Ministero dell’Istruzione,               shop (EVALITA 2023), CEUR. org, 2023.
     dell’Università e della Ricerca, Roma, Italia, 2010.       [17] C. D. Hromei, D. Croce, V. Basile, R. Basili, Extrem-
 [2] d. e. d. R. Ministero dell’Istruzione, Istituti tecnici:        ita at evalita 2023: Multi-task sustainable scaling to
     linee guida per il passaggio al nuovo ordinamento,              large language models at its extreme (2022).
     Ministero dell’Istruzione, dell’Università e della         [18] E. Zanoli, M. Barbini, C. Chesi, et al., Iussnets at
     Ricerca, Roma, Italia, 2010.                                    disco-tex: A fine-tuned approach to coherence, in:
 [3] T. A. Van Dijk, Context and cognition: Knowledge                Proceedings of the Eighth Evaluation Campaign
     frames and speech act comprehension, Journal of                 of Natural Language Processing and Speech Tools
     pragmatics 1 (1977) 211–231.                                    for Italian. Final Workshop (EVALITA 2023), CEUR.
 [4] T. Reinhart, Conditions for text coherence, Poetics             org, 2023.
     today 1 (1980) 161–180.                                    [19] D. Brunato, F. Dell’Orletta, I. Dini, A. A. Ravelli, Co-
 [5] F. Danes, Functional sentence perspective and the               herent or not? stressing a neural language model
     organization of the text, Papers on functional sen-             for discourse coherence in multiple languages, in:
     tence perspective 23 (1974) 106–128.                            Findings of the Association for Computational Lin-
 [6] P. H. Fries, On the status of theme in english: Argu-           guistics: ACL 2023, 2023, pp. 10690–10700.
     ments from discourse, Micro and macro connexity            [20] M. Moradi, M. Samwald, Evaluating the robustness
     of texts 45 (1983).                                             of neural language models to input perturbations,
 [7] J. R. Hobbs, Coherence and coreference, Cognitive               arXiv preprint arXiv:2108.12237 (2021).
     science 3 (1979) 67–90.                                    [21] Y. Zhang, L. Pan, S. Tan, M.-Y. Kan, Interpreting
 [8] B. J. Grosz, A. K. Joshi, S. Weinstein, Centering:              the robustness of neural nlp models to textual per-
     a framework for modelling the coherence of dis-                 turbations, arXiv preprint arXiv:2110.07159 (2021).
     course (1994).                                             [22] J. Wei, K. Zou, Eda: Easy data augmentation tech-
 [9] B. Di Eugenio, Centering in italian, arXiv preprint             niques for boosting performance on text classifica-
     cmp-lg/9608007 (1996).                                          tion tasks, arXiv preprint arXiv:1901.11196 (2019).
[10] R. Barzilay, M. Lapata, Modeling local coherence:          [23] A. Karimi, L. Rossi, A. Prati, Aeda: an easier
     An entity-based approach, Computational Linguis-                data augmentation technique for text classification,
     tics 34 (2008) 1–34.                                            arXiv preprint arXiv:2108.13230 (2021).
[11] Y. Farag, H. Yannakoudakis, T. Briscoe, Neural             [24] H. Q. Abonizio, E. C. Paraiso, S. Barbon, Toward
     automated essay scoring and coherence modeling                  text data augmentation for sentiment analysis, IEEE
     for adversarially crafted input, arXiv preprint                 Transactions on Artificial Intelligence 3 (2021) 657–
     arXiv:1804.06898 (2018).                                        668.
[12] M. Mesgar, M. Strube, A neural local coherence             [25] G. Rizos, K. Hemker, B. Schuller, Augment to pre-
     model for text quality assessment, in: Proceedings              vent: short-text data augmentation in deep learn-
     of the 2018 conference on empirical methods in                  ing for hate-speech classification, in: Proceedings
     natural language processing, 2018, pp. 4328–4339.               of the 28th ACM international conference on in-
[13] J. Li, E. Hovy, A model of coherence based on dis-              formation and knowledge management, 2019, pp.
     tributed sentence representation, in: Proceedings               991–1000.
     of the 2014 Conference on Empirical Methods in             [26] T. Kober, J. Weeds, L. Bertolini, D. Weir, Data aug-
     Natural Language Processing (EMNLP), 2014, pp.                  mentation for hypernymy detection, arXiv preprint
     2039–2048.                                                      arXiv:2005.01854 (2020).
[14] D. T. Nguyen, S. Joty, A neural local coherence            [27] T. Nugent, N. Stelea, J. L. Leidner, Detecting envi-
     model, in: Proceedings of the 55th Annual Meeting               ronmental, social and governance (esg) topics using
     of the Association for Computational Linguistics                domain-specific language models and data augmen-
     (Volume 1: Long Papers), 2017, pp. 1320–1330.                   tation, in: Flexible Query Answering Systems: 14th
[15] D. Brunato, D. Colla, F. Dell’Orletta, I. Dini, D. P.           International Conference, FQAS 2021, Bratislava,
                                                                     Slovakia, September 19–24, 2021, Proceedings 14,
     Springer, 2021, pp. 157–169.                             arXiv:1903.10676 (2019).
[28] A. Bienati, C. Vettori, L. Zanasi, In viaggio verso [44] L. Breiman, Random forests, Machine learning 45
     itaca: la coerenza testuale come meta della scrittura    (2001) 5–32.
     scolastica. proposta di una griglia di valutazione, [45] K. Clark, U. Khandelwal, O. Levy, C. D. Manning,
     Italiano a scuola 4 (2022) 55–70.                        What does bert look at? an analysis of bert’s atten-
[29] J.-C. Klie, M. Bugert, B. Boullosa, R. E. De Castilho,   tion, arXiv preprint arXiv:1906.04341 (2019).
     I. Gurevych, The inception platform: Machine- [46] M. Danilevsky, K. Qian, R. Aharonov, Y. Katsis,
     assisted and knowledge-oriented interactive anno-        B. Kawas, P. Sen, A survey of the state of ex-
     tation, in: Proceedings of the 27th international        plainable ai for natural language processing, arXiv
     conference on computational linguistics: System          preprint arXiv:2010.00711 (2020).
     demonstrations, 2018, pp. 5–9.
[30] A. Ferrari, Linguistica del testo. Principi, fenomeni,
     strutture, volume 151, Carocci, 2014.
[31] A. Conneau, G. Kruszewski, G. Lample, L. Barrault,
     M. Baroni, What you can cram into a single vector:
     Probing sentence embeddings for linguistic proper-
     ties, arXiv preprint arXiv:1805.01070 (2018).
[32] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
     Bert: Pre-training of deep bidirectional transform-
     ers for language understanding, arXiv preprint
     arXiv:1810.04805 (2018).
[33] A. Feltracco, E. Jezek, B. Magnini, M. Stede, Lico:
     A lexicon of italian connectives, CLiC it (2016) 141.
[34] C. E. Roggia, Una varietà dell’italiano tra scritto e
     parlato: la scrittura degli apprendenti, Ferrari A.,
     De Cesare AM (2010) (2010) 197–224.
[35] L. Cignetti, Didattica della scrittura e linguistica del
     testo: tre priorità di intervento, Ostinelli M.(a cura
     di), La didattica dell’italiano. Problemi e prospettive,
     DFA SUPSI, Locarno (2015) 14–24.
[36] A. Colombo, A me mi. Dubbi, errori, correzioni
     nell’italiano scritto: Dubbi, errori, correzioni
     nell’italiano scritto, FrancoAngeli, 2010.
[37] M. Prada, Scritto e parlato, il parlato nello scritto.
     per una didattica della consapevolezza diamesica,
     Italiano LinguaDue 8 (2016) 232–260.
[38] S. Schweter, Italian bert and electra models,
     2020. URL: https://doi.org/10.5281/zenodo.4263142.
     doi:10.5281/zenodo.4263142.
[39] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
     O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
     Roberta: A robustly optimized bert pretraining ap-
     proach, arXiv preprint arXiv:1907.11692 (2019).
[40] J. Abadji, P. O. Suarez, L. Romary, B. Sagot, Towards
     a cleaner document-oriented multilingual crawled
     corpus, arXiv preprint arXiv:2201.06642 (2022).
[41] T. Kudo, J. Richardson, Sentencepiece: A simple and
     language independent subword tokenizer and deto-
     kenizer for neural text processing, arXiv preprint
     arXiv:1808.06226 (2018).
[42] D. Licari, G. Comandè, Italian-legal-bert: A pre-
     trained transformer language model for italian law.,
     EKAW (Companion) 3256 (2022).
[43] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained
     language model for scientific text, arXiv preprint
A. Appendix A

                 Aug Techniques    GilBERTo F1 Score      ITACA-bert F1 Score     BERT-base-italian F1 Score
                 SHUFF             0.43                   0.5                     0.38
                 LICO              0.97                   0.96                    0.95
                 POLYFUNCT         0.88                   0.88                    0.89
                 PRON              1.0                    0.99                    0.99
                 SPLICE            0.56                   0.49                    0.55
                 PARATAX           0.99                   0.95                    0.97
Table 2
Model comparison on f1 score for each task. Each probe is run as a binary classification task on 636 dataset entries. The
baseline is set on 0.5


                       Aug Techniques      Train Dataset Len    Num Labels      Baseline     Accuracy
                       LICO                575                  2               0.5          0.96
                       POLYFUNCT           575                  2               0.5          0.78
                       PRON                575                  2               0.5          0.98
                       SPLICE              575                  2               0.5          0.7
                       PARATAX             575                  2               0.5          0.98
Table 3
Error analysis


B. Appendix B
“In base all’esperienza maturata durante la pandemia di Covid-19, il Ministro dell’Istruzione ha proposto di estendere
permanentemente, a partire dal prossimo anno scolastico, la Didattica Digitale Integrata (DDI, modalità didattica che
combina momenti di insegnamento a distanza e attività svolte in classe) al triennio delle scuole superiori [...]. Immagina
di dover scrivere una lettera al Ministro in cui esponi le tue ragioni a favore o contro questa possibilità, argomentandole
in modo da convincerlo della bontà delle tue idee [...]. Durante lo svolgimento del testo ricordati di: 1. Chiarire la tesi
che intendi difendere. 2. Spiegare le motivazioni a sostegno della tesi. 3. Prendere in considerazione il punto di vista
alternativo e illustrare le ragioni per cui non sei d’accordo. 4. Arrivare a una conclusione. 5. Prima di consegnare,
ricordati di rileggere con cura il testo che hai scritto. Il tuo obiettivo è convincere il Ministro della bontà della tesi che
sostieni. Hai 100 minuti di tempo per scrivere un testo di almeno 600 parole.”