<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards an Automatic Evaluation of (In)coherence in Student Essays</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Filippo Pellegrino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jennifer Carmen Frey</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorenzo Zanasi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Eurac Research Institute</institution>
          ,
          <addr-line>Viale Druso Drususallee, 1, 39100 Bolzano, Autonome Provinz Bozen - Südtirol</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Coherence modeling is an important task in natural language processing (NLP) with potential impact on other NLP tasks such as Natural Language Understanding or Automated Essay Scoring. Automatic approaches in coherence modeling aim to distinguish coherent from incoherent (often synthetically created) texts or to identify the correct continuation for a given sample of texts, as demonstrated for Italian in the DisCoTex task of EVALITA 2023. While early work on coherence modelling has focused on exploring definitions of the phenomenon, exploring the performance of neural models has dominated the field in recent years. However, coherence modelling can also ofer interesting linguistic insights with pedagogical implications. In this article, we target coherence modeling for the Italian language in a strongly domain-specific scenario, i.e. education. We use a corpus of student essays collected to analyse students' text coherence in combination with data perturbation techniques to experiment with the efect of various linguistically informed features of incoherent writing on current coherence modelling strategies used in NLP. Our results show the capabilities of encoder models to capture features of (in)coherence in a domain-specific scenario discerning natural from artificially corrupted texts.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Coherence modelling</kwd>
        <kwd>data perturbation</kwd>
        <kwd>transformers</kwd>
        <kwd>education</kwd>
        <kwd>student essays</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>evaluation for student essays. While large language
models have been used successfully in domain general
coherence modelling before, we test their efectiveness for text
analysis in this domain-specific scenario, taking into
account both surface and non-standard language features.
We discuss:</p>
      <sec id="sec-1-1">
        <title>Argumentative essay writing is a fundamental objective</title>
        <p>
          in education for both vocational schools and high schools
in Italy, as indicated in [
          <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
          ]. It requires students to
present arguments supported by personal knowledge or
external sources in a coherent and convincing manner.
        </p>
        <p>However, writing coherent texts poses both cognitive • data perturbation techniques to artificially
reproand linguistic challenges to novice writers and textual duce real-life scenario incoherence in textual data
competences related to it are frequently claimed to be • a custom probing task design
insuficient, putting pressure on the educational system. • automatic evaluation of coherence using diferent
Automatically discerning incoherent texts or passages encoding models
could help teachers to better understand students’
problems and give targeted instructions, while students would The results of our experiments show the performances of
benefit from more frequent and more timely feedback. encoder models in recognizing patterns of (in)coherence
However, to date, most NLP research in automatic coher- in a domain-specific educational context such as upper
ence modelling focused on semantic similarity between secondary school student essays. The paper is organized
two parts of texts using mostly well-formed newspaper as follows: Section 2 provides an overview of previous
or Wikipedia texts, ofering little information for educa- approaches to coherence modelling and NLP data
perturtional contexts. bation with a focus on Italian NLP. Section 3 introduces
In this study, we explore coherence from an educational the data we used for this study, giving information on
perspective, utilizing recent language models and data the research project it originates in as well as on the
corperturbation techniques to probe their value for linguis- pus design and annotation. Section 4 provides a detailed
tically informed and informative automatic coherence description of our methodology introducing our custom
probing tasks (Section 4.1), used Models (Section 4.2.1)
and text encoding 4.3 as well as a description of the two
analyses performed (Section 4.4 and Section 4.5).
Sections 5 and 6 present and discuss our results and Section
7 concludes the article with final considerations.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>3.1. ITACA Corpus</title>
        <p>
          2.1. Coherence modelling The ITACA corpus1 is an annotated learner corpus
created within the project ITACA: Coerenza nell’ITAliano
Coherence modeling is an important task in natural lan- Accademico [28]. It consists of a total of 636
argumentaguage processing (NLP) with potential impact on other tive essays from Italian L1 upper secondary school
stuNLP tasks such as Natural Language Understanding dents from the autonomous province of Bolzano/Bozen2
or automated essay scoring. Early work on coherence during the school year 2021/2022. The texts were
colmodelling focused on the definition of the phenomenon lected by asking 12th grade students to type an
argumen[
          <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6 ref7">3, 4, 5, 6, 7</xref>
          ] and provides valuable frameworks such as tative essay following precise indications of writing time,
Centering Theory [
          <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
          ] and Entity-Grid approach [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. text length and topic. The full assignment can be
conFollowing the great development of neural network sys- sulted in the Appendix B. While the assignment asked for
tems in recent years, many works such as [
          <xref ref-type="bibr" rid="ref11 ref12 ref13 ref14">11, 12, 13, 14</xref>
          ] a minimun text length of 600 words, the average number
explored coherence modelling implementing further and of tokens in the essay is with 668, just slightly above the
more sophisticated solutions for the English language. minimum length requirement.
        </p>
        <p>
          Recently, the Italian NLP community has approached The totality of the 636 collected texts constitutes 382,964
the topic from an engineering point of view, using Ital- tokens. All data were collected digitally and
anonyian pre-trained neural models to distinguish coherent mously and underwent subsequent control and cleaning
from (mainly synthetically constructed) non-coherent procedures, partly manually, to ensure their integrity
texts [
          <xref ref-type="bibr" rid="ref15 ref16 ref17 ref18">15, 16, 17, 18</xref>
          ]. Some eforts were also made for and to guarantee the anonymity of the participants.
Esmultilingual scenarios [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] demonstrating the encoding says were collected, by asking students to type their
escapabilities of multilingual models for coherence features. says into an input field in an online form, additional
metadata was collected by a subsequent online
question2.2. Data perturbation naire asking for basic socio-demographic information,
students’ language background, and reading and writing
In data perturbation, dataset entries are corrupted with habits. The whole corpus was automatically tokenized,
specific computational operations to simulate noise con- lemmatized and annotated for part-of-speech and
syntacdition and test the model performance on real world con- tic dependencies with the support of project collaborators
ditions [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. Many studies on data perturbation and data from Fondazione Bruno Kessler, who also supported the
augmentation in NLP focus on model agnostic methods project in the setup of an interface for manual annotation
[
          <xref ref-type="bibr" rid="ref20 ref21 ref22 ref23">20, 21, 22, 23</xref>
          ] using random deletion, random swap, syn- based on Inception[29].
onym replacement, random insertion and punctuation A manual annotation of a subset of 388 texts was
perinsertion techniques for text classification with limited formed by two trained annotators and ofers detailed
amount of data. More sophisticated and task-oriented descriptions of the text’s structure, with a focus on the
data augmentation approaches are proposed for senti- use of various linguistic features (such as punctuation,
ment analysis [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], hate speech classicfiation [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], hyper- connectives, agreements, anaphora, contradictions) that
nymy detection [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] and domain specific classification enhance or limit the text’s cohesion and coherence.
[
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. The manual annotation of the corpus was guided by the
three sections elaborated in [30] and contained
annotations for traits of incoherence referring to
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Data</title>
      <sec id="sec-3-1">
        <title>The data used in this study originates from a research</title>
        <p>project, conducted in South Tyrol between 2020 and 2024.
The project named ITACA: Coerenza nell’ITAliano
Accademico [28] had the aim to study textual competences
of students in their first language Italian with particular
focus on aspects of text coherence. Within the project
various outcomes have been produced: a corpus of Italian
student essays collected in Italian South Tyrolean upper
secondary schools, a validated rating scale to evaluate
coherence in student essays, and coherence ratings for
texts in the corpus from three independent raters using
the previously developed rating scale. The products are
described in the following section.</p>
      </sec>
      <sec id="sec-3-2">
        <title>1. segmentation (e.g. splice comma, added comma,</title>
        <p>not-signed parenthetical clause)
2. logic-argumentative plan (e.g. issues in the use
of connectives, contradictions)
3. thematic-referential plan (e.g. critical agreement,
critical anaphora, not-expanded comment)</p>
      </sec>
      <sec id="sec-3-3">
        <title>The corpus is accessible through an ANNIS search inter</title>
        <p>face 3and can be downloaded in various formats from the
Eurac Research Clarin Center (ERCC) under the CLARIN
ACADEMIC END-USER LICENCE ACA-BY-NC-NORED
1https://www.porta.eurac.edu/lci/itaca/
2texts are collected in Bolzano, Bressanone, Merano and Brunico
3https://commul.eurac.edu/annis/itaca
1.0 licence 4. Downloads and further documentation can features throughout the whole essay, but only struggle
also be accessed via Eurac Research’s PORTA platform5. occasionally (e.g. not all connectives are semantically
incorrect), we reduced the perturbation ratio to 50%
3.2. Manual coherence ratings in Pronoun Perturbation, Splice Comma Perturbation
and Parataxis Perturbation in order to create realistic
conditions and increase the dificulty of the single tasks.</p>
        <p>Although data perturbation can also operate on the
character level, we opted for token- and sentence-level
approaches maintaining parameters in a controlled
setting.</p>
        <p>Each single essay was additionally manually evaluated
in a double-blind manner by a panel of six experts who
applied a specially created, rating scale, which was
subsequently validated to assess textual coherence. The items
were rated on a Likert scale from one to ten and referred
to three dimensions of coherence (structure,
comprehensibility, segmentation). The average structure score  is We implemented the following custom
attested at 4.55 with standard deviation  = 5. For compre- tasks:
hensibility,  = 6.29 and  = 1.65, while for segmentation
 = 5.99 and  = 1.79.
probing</p>
        <sec id="sec-3-3-1">
          <title>4.1. Custom Probing Tasks</title>
          <p>Using data perturbation techniques, we aim to reproduce
both general-purpose coherence modelling perturbation
strategies and modifications inspired by some of the
most salient features of textual (in)coherence observed Splice Comma Perturbation [SPLICE]:
in the annotation process for the ITACA project. These A splice comma is the use of a comma to join two
include incoherent order of arguments and sentences, independent sentences. The comma can substitute
incorrect use of connectives, overuse of polyfunctional a dot, a colon, or semicolon [34, 35, 36, 37]. In our
connectives, unresolved co-reference, the use of splice case, long pause markers such as periods, colons, or
comma and an overuse of paratactical constructions. semicolons were substituted with a comma. We apply
Assuming that students would not produce the these the perturbation to just 50% of the conjunctions in the
text to partially keep punctuation unaltered.</p>
          <p>Pronoun Perturbation [PRON]:
For a very simplistic approximation of corrupted
anaphoric references, we identified pronouns with
Stanza and replaced them randomly by other pronouns
isoleted from the corpus. To ensure a minimum of
correct pronouns, only 50% of the pronouns in the text
were corrupted.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <sec id="sec-4-1">
        <title>In this study, we focus on NLP data perturbation [20, 21]</title>
        <p>and custom probing tasks [31] to evaluate the ability of Connective Perturbation [LICO]:
Italian BERT models of discerning features of coherence In order to imitate texts in which the logical connection
given diferent pre-training conditions and fine tuning. between phrases is erroneous, we randomly substituted
In our analysis, we aim to evaluate automatic coherence connectives used in the text exploiting both manual
modelling techniques, applying them to student essays and automatic processing with Stanza6; To identify
with varying degrees of well-formedness and coherence. the connectives to substitute, we referred to a string
We conducted a number of experiments probing whether matching of all connectives listed in the Lexicon of
state-of-the-art coherence modelling techniques based Italian Connectives (LICO) [33].
on BERT encodings would be able to distinguish between
original, i.e. allegedly coherent texts and those contain- Polyfunctional Connective Perturbation
[POLYing features of incoherence identified for student writing FUNCT]:
before. In our case study, we use data perturbation tech- Based on the ITACA corpus annotation scheme, we
niques to reproduce specific students’ errors observed implement a probing task, imitating young writers
during the textual analysis of the ITACA project [28] (see tendency to use simple polifunctional connectives
Section 3), in order to apply text modification in a fully instead of highly semantically loaded ones. For this, we
controlled fashion. We used representations obtained substitute all connectives in the text by the
polyfuncfrom BERT [32] models to demonstrate the ability of au- tional connective "e".
tomatic systems to encode patterns of (in)coherence in
a specialized scenario such as Italian student essays and
evaluate their potential for educational purposes.</p>
        <p>
          Sentence Order Perturbation [SHUFF]:
As in other synthetic datasets for coherence modelling
[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] this data perturbation technique is to randomly
shufle sentences within the texts.
4http://hdl.handle.net/20.500.12124/76
5https://www.porta.eurac.edu/itaca
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>6https://stanfordnlp.github.io/stanza/</title>
        <p>Parataxis Perturbation [PARATAX]:
Coordinating conjunctions extracted with Stanza are
substituted with punctuation taken from a list to create
paratactic sentences. We apply the perturbation to
just 50% of the conjunctions in the text to keep some
conjunctions untouched.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Text perturbation examples can be consulted in Table 1</title>
        <sec id="sec-4-3-1">
          <title>4.2. Models</title>
          <p>4.2.1. Pre-trained Models
For our experiments, we test three diferent BERT-based
models to obtain vector representations for our probing
tasks.
says typologically similar to our dataset, thankfully
provided for this purpose by the Fondazione Bruno Kessler
(FBK). The number of essays employed for the fine-tuning
corresponds to 2096 dataset entries with a mean text
length of 705 tokens. Fine-tuning our BERT model
allowed us to provide further contextual and text essay
style information to the pre-trained model, increasing
the model’s ability in domain-specific text representation.</p>
          <p>The provided hyperparameter configuration for training
is: truncation = max length, padding = max length, batch
size = 16, learning rate = 5e-5 and epochs = 2. The model
is trained on both Masked Language Modeling and Next
Sentence Prediction tasks [32]. Taking into account the
limited amount of data and the relatively quick training
time, we use the L4 GPU available in Google Colab10 (pro
version).
Sentence Order Perturbation
LICO Connective Perturbation
Polyfunctional Connective Perturbation
Pronoun Perturbation
Splice Comma Perturbation
Parataxis Perturbation
Example Sentence
Stamattina io sono andato al mercato. Ho comprato delle mele e delle arance. Poi
sono tornato a casa e ho preparato una torta.</p>
          <p>Poi sono tornato a casa e ho preparato una torta. Stamattina io sono andato al mercato.
Ho comprato delle mele e delle arance.</p>
          <p>Stamattina io sono andato al mercato. Ho comprato delle mele e delle arance. Poi
sono tornato a casa invece di ho preparato una torta.</p>
          <p>Stamattina io sono andato al mercato. Ho comprato delle mele e delle arance. e sono
tornato a casa e ho preparato una torta.</p>
          <p>Stamattina noi sono andato al mercato. Ho comprato delle mele e delle arance. Poi
sono tornato a casa e ho preparato una torta.</p>
          <p>Stamattina io sono andato al mercato, Ho comprato delle mele e delle arance, Poi sono
tornato a casa e ho preparato una torta.</p>
          <p>Stamattina io sono andato al mercato. Ho comprato delle mele, delle arance. Poi sono
tornato a casa. ho preparato una torta.</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>4.3. Text Encoding</title>
          <p>Inspired by the works of [42] and [43], the BERT-ita
model was fine-tuned using a dataset of high school
es4.2.2. BERT-ita Fine-tuning
1. BERT-ita base [38]: trained with Italian data from
the OPUS corpora collection7 and Wikipedia8.The We retrieved vector representations and performed a
biifnal training corpus has a size of 13GB and nary text classification experiment for each perturbation
2,050,057,573 tokens. technique11. The model is fed with batch size = 1 with
2. GilBERTo9: RoBERTa based model [39]. The all the texts contained in the set. To overcome the length
model is trained with the subword masking tech- input limit of 512 tokens imposed by BERT models and
nique for 100k steps managing 71GB of Italian process the entire text in a row with no loss of contextual
text with 11,250,012,896 words [40]. The team information, we split the text into two segments when
took up a vocabulary of 32k BPE subwords, gen- reached the max input lenght. Furthermore, we adopted a
erated using SentencePiece tokenizer [41]. mean-pooling strategy by calculating the mean between
the last hidden state of each contextualized token
embedding in the batch across the input sequence length.</p>
          <p>The final text representation is the mean of all segment
embeddings in the batch.
7https://opus.nlpl.eu/
8https://it.wikipedia.org/wiki/Pagina_principale
9https://github.com/idb-ita/GilBERTo?tab=readme-ov-file
10https://colab.research.google.com/
11The code for this part of the project was written with the help of
the AI tool Chat GPT.</p>
        </sec>
        <sec id="sec-4-3-3">
          <title>4.4. Model Performance Analysis</title>
          <p>We first perform a model performance analysis,
comparing the model performance in classification for each of
the custom probing tasks with each of the three
models. Classification is performed with a Random Forest
classifier [ 44], defining each experiment as a binary
classification between the original and perturbated texts. The
classes were balanced across the entire dataset. To
optimize the amount of available data for training and testing,
we use 10-fold cross-validation for evaluation. We
compare model performance against a majority class baseline
(0.5 for balanced binary classification) and against each
other using f1 scores.</p>
        </sec>
        <sec id="sec-4-3-4">
          <title>4.5. Error Analysis</title>
          <p>In a subsequent analysis, we compare the model
predictions of our best-performing model with the human Figure 1: Model performances comparison on single probing
coherence ratings provided for the corpus. In order to tasks
obtain a single coherence score for each essay, the scores
were averaged over the diferent annotators and the three
components (structure, comprehensibility and segmen- not expect these diferences to be significant. Except for
tation; see Section 3). We perform an error analysis by the improvement in the shufling task after fine-tuning,
comparing the predictions for unmodified texts with the the ITACA-bert model remains comparable to its base
highest and lowest coherence scores using a random for- version, probably due to the scarcity of domain-specific
est classifier trained with the model that achieved the training data. Results showed that models achieved
betbest results in the model comparison. Assuming that all ter performance on semantic tasks such as polyfunctional
tasks have the same weight, we select the best perform- conjunction perturbation or pronoun perturbation while
ing model according to the average f1 score achieved in struggling with syntactic probing tasks such as shufling
the model performance analysis (see Section 4.4). The and splice comma perturbation. For the shufling task,
train set for this evaluation corresponds to 90% of the a considerable improvement can be observed after
finedata, while the test set represents the 5% of essays with tuning (+0.12% from F1 = 0.38 to F1 = 0.50). However,
the highest ( = 8.28,  = 0.36) and the 5% with the lowest neither of the shufling models performs better than a
coherence scores ( = 2.63,  = 0.51). Finally, we inter- random baseline, while the splice comma experiment
pret the results, manually investigating texts that were models performed slightly better, with the BERT-ita and
misclassified as modified texts from both tails of the test Gilberto models marginally beating the baseline of 0.5. A
set. graphical comparison between model performances can
be seen in Figure 1.</p>
          <p>A detailed overview of the classification results for single
5. Results tasks and models can be found in the Appendix A. The
tables provide measures of the f1 score for each experiment
and model.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>The classification experiments show the ability of the</title>
        <p>BERT models to encode the features of (in)coherence
represented by the perturbation techniques introduced in
Section 4.1. The following sections illustrate our findings
for the BERT model comparison and the error analysis
conducted on a selected subset of non-modified texts.</p>
        <sec id="sec-4-4-1">
          <title>5.1. Models Comparison Analysis</title>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>F1 scores for most models were very similar with just</title>
        <p>small diferences between the three models. In average,
GilBERTo was found to be the best performing model for
most tasks, probably due to its higher amount of training
data and its lighter model architecture. However, we do</p>
        <sec id="sec-4-5-1">
          <title>5.2. Error analysis on evaluation set</title>
          <p>To better observe the encoding and classification
performance of BERT, we decide to isolate the texts with the
highest and the lowest coherence scores according to the
average coherence scores as specified in 4.5. The
resulting test set corresponds roughly to the 10% of the total
number of texts in the corpus. Our expectation is that
texts with lower coherence scores have a higher chance
to be misclassified as modified texts, while texts with
higher coherence scores should not lead the classifiers
to identify traits of incoherence as specified in the
custom probing tasks. We perform all analysis using the
GilBERTo model for text encoding, as it was revealed to
be the best performing model when averaging f1 scores
on all tasks of the model performance analysis (see
Section 4.4). However, we exclude the shufling task as model
performance was below the baseline and therefore too
low for interpretation. Thus, we train a random forest
classifier with the 90% of the train set, for all custom
probing tasks described in Section 4.1.</p>
          <p>Our results show that the distribution of misclassified
labels is generally skewed toward texts with lower
coherence scores, but misclassifications for texts with higher
coherence scores were also found. While the splice
comma and polyfunctional conjunction (see Figure 2)
probing tasks showed clearly more misclassifications on
the lower tail of the dataset, also well-rated texts were
occasionally misclassified as perturbed texts. On the
contrary, the small number of misclassifications on the
parataxis and pronoun perturbation probing tasks might
suggest that the operationalizations taken in this work
are too simplistic to be representative of students’
mistakes in the texts and, therefore, not able to pick up on
traits of incoherence present in the students’ essays. The
results of the experiment can be consulted in Appendix
A.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Discussion</title>
      <p>Although data perturbation cannot fully reproduce the
variability of real-word students’ mistakes, our results
give precious insights about the ability of BERT encoders
to capture degrees of coherence on both syntactic and
semantic level. Of course, the eficiency of the data
perturbation might be influenced by several factors, such
as the fact that the original texts used for our
experiments already naturally contain errors of the same or
other types. However, we argue that this is the case
for any type of data set of unknown quality that is
subject to automatic coherence evaluation. Thus, before the
evaluation, texts have not been subjected to any review
and, excluding other external factors, they reproduce
real-world writing conditions. The results of language
encoding and classification depend on the dificulty of
the perturbation task and on the original training of the
BERT model. However, despite the fact that the BERT-ita
base and GilBERTo exploit diferent training strategies,
no drastic performance fluctuations have been observed
on our selected language tasks. Even though the efects
of fine-tuning with domain-specific data is limited to the
amount of afordable data, the efect can already be
observed by looking at the increment on the shufling task
performance.</p>
      <p>The classification of the evaluation set highlighted the
potential of data perturbation techniques for the
encoding of (in)coherence features. Previous approaches to
coherence modelling implemented solutions inspired by
theoretical intuitions. In our case, we decided to start
from natural textual errors and check the ability of the
model in capturing the same features presented in the
text. For a more transparent interpretation of results and
explanation of individual classification it would be of
interest to check how attention maps change according
to the tuning of the model [45].</p>
    </sec>
    <sec id="sec-6">
      <title>7. Conclusion</title>
      <p>In this paper, we presented an evaluation of coherence
modelling techniques for detecting incoherence in
student essays based on surface-level features of
incoherence. We used the ITACA corpus of Italian upper
secondary school essays to perform a number of
classification techniques using data perturbation and BERT-based
text encoding methods. After a preliminary comparison
between pre-trained and fine-tuned models we adopted
the best performing one according to our results. The
results of the chosen tasks are influenced by the
implementation of the perturbation technique, the encoding
ability of the model, and the amount and the quality of
the data the model is pre-trained on. The best
performances are bounded to the model pre-trained with the
highest amount of data (GilBERTo). We based our
evaluation on simple f1 measures considering this suficiently
indicative of the encoding ability of the model applied to
each specific probing task.</p>
      <p>Since we mainly tested custom perturbation techniques
and the encoding abilities of BERT models, future
research directions might involve data perturbation
techniques enhancement, XAI techiques for model behaviour
analysis [46, 45] and the exploitation of state-of-the-art
generative one shot and few-shot models in a highly
domain-specific scenario such as school essays writing.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <sec id="sec-7-1">
        <title>We thank Fondazione Bruno Kessler Trento for their support on the ITACA corpus and for allowing us to use their student essay dataset for fine-tuning.</title>
        <p>Springer, 2021, pp. 157–169. arXiv:1903.10676 (2019).
[28] A. Bienati, C. Vettori, L. Zanasi, In viaggio verso [44] L. Breiman, Random forests, Machine learning 45
itaca: la coerenza testuale come meta della scrittura (2001) 5–32.
scolastica. proposta di una griglia di valutazione, [45] K. Clark, U. Khandelwal, O. Levy, C. D. Manning,
Italiano a scuola 4 (2022) 55–70. What does bert look at? an analysis of bert’s
atten[29] J.-C. Klie, M. Bugert, B. Boullosa, R. E. De Castilho, tion, arXiv preprint arXiv:1906.04341 (2019).</p>
        <p>I. Gurevych, The inception platform: Machine- [46] M. Danilevsky, K. Qian, R. Aharonov, Y. Katsis,
assisted and knowledge-oriented interactive anno- B. Kawas, P. Sen, A survey of the state of
extation, in: Proceedings of the 27th international plainable ai for natural language processing, arXiv
conference on computational linguistics: System preprint arXiv:2010.00711 (2020).</p>
        <p>demonstrations, 2018, pp. 5–9.
[30] A. Ferrari, Linguistica del testo. Principi, fenomeni,</p>
        <p>strutture, volume 151, Carocci, 2014.
[31] A. Conneau, G. Kruszewski, G. Lample, L. Barrault,</p>
        <p>M. Baroni, What you can cram into a single vector:
Probing sentence embeddings for linguistic
properties, arXiv preprint arXiv:1805.01070 (2018).
[32] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,</p>
        <p>Bert: Pre-training of deep bidirectional
transformers for language understanding, arXiv preprint
arXiv:1810.04805 (2018).
[33] A. Feltracco, E. Jezek, B. Magnini, M. Stede, Lico:</p>
        <p>A lexicon of italian connectives, CLiC it (2016) 141.
[34] C. E. Roggia, Una varietà dell’italiano tra scritto e
parlato: la scrittura degli apprendenti, Ferrari A.,</p>
        <p>De Cesare AM (2010) (2010) 197–224.
[35] L. Cignetti, Didattica della scrittura e linguistica del
testo: tre priorità di intervento, Ostinelli M.(a cura
di), La didattica dell’italiano. Problemi e prospettive,</p>
        <p>DFA SUPSI, Locarno (2015) 14–24.
[36] A. Colombo, A me mi. Dubbi, errori, correzioni
nell’italiano scritto: Dubbi, errori, correzioni
nell’italiano scritto, FrancoAngeli, 2010.
[37] M. Prada, Scritto e parlato, il parlato nello scritto.</p>
        <p>per una didattica della consapevolezza diamesica,</p>
        <p>Italiano LinguaDue 8 (2016) 232–260.
[38] S. Schweter, Italian bert and electra models,
2020. URL: https://doi.org/10.5281/zenodo.4263142.</p>
        <p>doi:10.5281/zenodo.4263142.
[39] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,</p>
        <p>O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
Roberta: A robustly optimized bert pretraining
approach, arXiv preprint arXiv:1907.11692 (2019).
[40] J. Abadji, P. O. Suarez, L. Romary, B. Sagot, Towards
a cleaner document-oriented multilingual crawled
corpus, arXiv preprint arXiv:2201.06642 (2022).
[41] T. Kudo, J. Richardson, Sentencepiece: A simple and
language independent subword tokenizer and
detokenizer for neural text processing, arXiv preprint
arXiv:1808.06226 (2018).
[42] D. Licari, G. Comandè, Italian-legal-bert: A
pretrained transformer language model for italian law.,</p>
        <p>EKAW (Companion) 3256 (2022).
[43] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained</p>
        <p>language model for scientific text, arXiv preprint</p>
        <p>Aug Techniques
SHUFF
LICO
POLYFUNCT
PRON
SPLICE
PARATAX</p>
        <p>GilBERTo F1 Score
0.43
0.97
0.88
1.0
0.56
0.99</p>
        <p>ITACA-bert F1 Score
0.5
0.96
0.88
0.99
0.49
0.95</p>
        <p>BERT-base-italian F1 Score
0.38
0.95
0.89
0.99
0.55
0.97</p>
        <p>Baseline
0.5
0.5
0.5
0.5
0.5</p>
        <p>Accuracy
0.96
0.78
0.98
0.7
0.98
B. Appendix B
“In base all’esperienza maturata durante la pandemia di Covid-19, il Ministro dell’Istruzione ha proposto di estendere
permanentemente, a partire dal prossimo anno scolastico, la Didattica Digitale Integrata (DDI, modalità didattica che
combina momenti di insegnamento a distanza e attività svolte in classe) al triennio delle scuole superiori [...]. Immagina
di dover scrivere una lettera al Ministro in cui esponi le tue ragioni a favore o contro questa possibilità, argomentandole
in modo da convincerlo della bontà delle tue idee [...]. Durante lo svolgimento del testo ricordati di: 1. Chiarire la tesi
che intendi difendere. 2. Spiegare le motivazioni a sostegno della tesi. 3. Prendere in considerazione il punto di vista
alternativo e illustrare le ragioni per cui non sei d’accordo. 4. Arrivare a una conclusione. 5. Prima di consegnare,
ricordati di rileggere con cura il testo che hai scritto. Il tuo obiettivo è convincere il Ministro della bontà della tesi che
sostieni. Hai 100 minuti di tempo per scrivere un testo di almeno 600 parole.”</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>d. e. d. R.</surname>
          </string-name>
          <article-title>Ministero dell'Istruzione, Indicazioni nazionali per i licei, Ministero dell'Istruzione, dell'Università e della Ricerca</article-title>
          , Roma, Italia,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>d. e. d. R.</surname>
          </string-name>
          <article-title>Ministero dell'Istruzione, Istituti tecnici: linee guida per il passaggio al nuovo ordinamento, Ministero dell'Istruzione, dell'Università e della Ricerca</article-title>
          , Roma, Italia,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>T. A. Van Dijk</surname>
          </string-name>
          ,
          <article-title>Context and cognition: Knowledge frames and speech act comprehension</article-title>
          ,
          <source>Journal of pragmatics 1</source>
          (
          <year>1977</year>
          )
          <fpage>211</fpage>
          -
          <lpage>231</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Reinhart</surname>
          </string-name>
          ,
          <article-title>Conditions for text coherence</article-title>
          ,
          <source>Poetics today 1</source>
          (
          <year>1980</year>
          )
          <fpage>161</fpage>
          -
          <lpage>180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Danes</surname>
          </string-name>
          ,
          <article-title>Functional sentence perspective and the organization of the text</article-title>
          ,
          <source>Papers on functional sentence perspective 23</source>
          (
          <year>1974</year>
          )
          <fpage>106</fpage>
          -
          <lpage>128</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P. H.</given-names>
            <surname>Fries</surname>
          </string-name>
          ,
          <article-title>On the status of theme in english: Arguments from discourse</article-title>
          ,
          <source>Micro and macro connexity of texts 45</source>
          (
          <year>1983</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Hobbs</surname>
          </string-name>
          , Coherence and coreference,
          <source>Cognitive science 3</source>
          (
          <year>1979</year>
          )
          <fpage>67</fpage>
          -
          <lpage>90</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Grosz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Weinstein</surname>
          </string-name>
          ,
          <article-title>Centering: a framework for modelling the coherence of discourse (</article-title>
          <year>1994</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B.</given-names>
            <surname>Di Eugenio</surname>
          </string-name>
          , Centering in italian,
          <source>arXiv preprint cmp-lg/9608007</source>
          (
          <year>1996</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Barzilay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lapata</surname>
          </string-name>
          ,
          <article-title>Modeling local coherence: An entity-based approach</article-title>
          ,
          <source>Computational Linguistics</source>
          <volume>34</volume>
          (
          <year>2008</year>
          )
          <fpage>1</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Farag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yannakoudakis</surname>
          </string-name>
          , T. Briscoe,
          <article-title>Neural automated essay scoring and coherence modeling for adversarially crafted input</article-title>
          , arXiv preprint arXiv:
          <year>1804</year>
          .
          <volume>06898</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mesgar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Strube</surname>
          </string-name>
          ,
          <article-title>A neural local coherence model for text quality assessment</article-title>
          ,
          <source>in: Proceedings of the 2018 conference on empirical methods in natural language processing</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>4328</fpage>
          -
          <lpage>4339</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <article-title>A model of coherence based on distributed sentence representation</article-title>
          ,
          <source>in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>2039</fpage>
          -
          <lpage>2048</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D. T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Joty</surname>
          </string-name>
          ,
          <article-title>A neural local coherence model</article-title>
          ,
          <source>in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2017</year>
          , pp.
          <fpage>1320</fpage>
          -
          <lpage>1330</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D.</given-names>
            <surname>Brunato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Colla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Dini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Radicioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ravelli</surname>
          </string-name>
          , et al.,
          <source>Discotex at evalita</source>
          <year>2023</year>
          <article-title>: overview of the assessing discourse coherence in italian texts task</article-title>
          ,
          <source>in: CEUR WORKSHOP PROCEEDINGS</source>
          , volume
          <volume>3473</volume>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Galletti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gravino</surname>
          </string-name>
          , G. Prevedello, Mpg at discotex:
          <article-title>Predicting text coherence by treebased modelling of linguistic features, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          .
          <source>Final Workshop (EVALITA</source>
          <year>2023</year>
          ), CEUR. org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>C. D. Hromei</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Croce</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Basile</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Basili</surname>
          </string-name>
          , Extremita at evalita
          <year>2023</year>
          <article-title>: Multi-task sustainable scaling to large language models at its extreme (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zanoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Barbini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chesi</surname>
          </string-name>
          , et al.,
          <article-title>Iussnets at disco-tex: A fine-tuned approach to coherence, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          .
          <source>Final Workshop (EVALITA</source>
          <year>2023</year>
          ), CEUR. org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>D.</given-names>
            <surname>Brunato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Dini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ravelli</surname>
          </string-name>
          ,
          <article-title>Coherent or not? stressing a neural language model for discourse coherence in multiple languages</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: ACL</source>
          <year>2023</year>
          ,
          <year>2023</year>
          , pp.
          <fpage>10690</fpage>
          -
          <lpage>10700</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Moradi</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Samwald, Evaluating the robustness of neural language models to input perturbations</article-title>
          ,
          <source>arXiv preprint arXiv:2108.12237</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Pan,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tan</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kan</surname>
          </string-name>
          ,
          <article-title>Interpreting the robustness of neural nlp models to textual perturbations</article-title>
          ,
          <source>arXiv preprint arXiv:2110.07159</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zou</surname>
          </string-name>
          , Eda:
          <article-title>Easy data augmentation techniques for boosting performance on text classification tasks</article-title>
          , arXiv preprint arXiv:
          <year>1901</year>
          .
          <volume>11196</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Karimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Prati</surname>
          </string-name>
          ,
          <article-title>Aeda: an easier data augmentation technique for text classification</article-title>
          ,
          <source>arXiv preprint arXiv:2108.13230</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>H. Q.</given-names>
            <surname>Abonizio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. C.</given-names>
            <surname>Paraiso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Barbon</surname>
          </string-name>
          ,
          <article-title>Toward text data augmentation for sentiment analysis</article-title>
          ,
          <source>IEEE Transactions on Artificial Intelligence</source>
          <volume>3</volume>
          (
          <year>2021</year>
          )
          <fpage>657</fpage>
          -
          <lpage>668</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>G.</given-names>
            <surname>Rizos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hemker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          ,
          <article-title>Augment to prevent: short-text data augmentation in deep learning for hate-speech classification</article-title>
          ,
          <source>in: Proceedings of the 28th ACM international conference on information and knowledge management</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>991</fpage>
          -
          <lpage>1000</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kober</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weeds</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bertolini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weir</surname>
          </string-name>
          ,
          <article-title>Data augmentation for hypernymy detection</article-title>
          , arXiv preprint arXiv:
          <year>2005</year>
          .
          <year>01854</year>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>T.</given-names>
            <surname>Nugent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Stelea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Leidner</surname>
          </string-name>
          ,
          <article-title>Detecting environmental, social and governance (esg) topics using domain-specific language models and data augmentation</article-title>
          ,
          <source>in: Flexible Query Answering Systems: 14th International Conference, FQAS 2021</source>
          , Bratislava, Slovakia,
          <source>September 19-24</source>
          ,
          <year>2021</year>
          , Proceedings 14,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>