Quale testo è scritto meglio?
      A Study on Italian Native Speakers’ Perception of Writing Quality
                       Aldo Cerulli• , Dominique Brunato⋄ , Felice Dell’Orletta⋄
                                            •
                                              University of Pisa
                                a.cerulli1@studenti.unipi.it
              ⋄
                Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR)
                                     ItaliaNLP Lab - www.italianlp.it
             {dominique.brunato, felice.dellorletta}@ilc.cnr.it


                       Abstract                            of existing corpora is that they are cross-sectional
                                                           rather than longitudinal. A notable exception in
    This paper presents a pilot study focused on
                                                           the context of Italian as L1 – which is the focus
    Italian native speakers’ perception of writing
    quality. A group of native speakers expressed          of our contribution – is represented by CItA (Cor-
    their preferences on 100 pairs of essays ex-           pus Italiano di Apprendenti L1), which was jointly
    tracted from an Italian corpus of compositions         developed by the Institute for Computational Lin-
    written by L1 students of lower secondary              guistics of the Italian National Research Council
    school. Analysing their answers, it was pos-           (CNR) of Pisa and the Department of Social and
    sible to identify a set of linguistic features char-   Developmental Psychology at Sapienza University
    acterizing essays perceived as well written and
                                                           of Rome (Barbagli et al., 2016): it is the first dig-
    to assess the impact of students errors on the
    perception of text quality. The paper describes        italized collection of essays written by the same
    the crowdsourcing technique to collect data as         group of Italian L1 learners in the first two years of
    well as the linguistic analysis and results.           the lower secondary school1 .
                                                              The diachronic and longitudinal nature of CItA
1   Introduction                                           makes it particularly suitable to study the evolution
The institution of distance learning paradigms,            of L1 writing competence over the two years, as-
which has become crucial during the Covid-19 pan-          suming that many remarkable changes in writing
demic, showed the need to provide schools and uni-         skills occur in this period. For instance, in their
versities with Natural Language Processing (NLP)-          recent work, Miaschi et al. (2021) showed that it is
based tools to assist students, teachers and profes-       possible to automatically learn the writing develop-
sors. Nowadays, language technologies are more             ment curve of students: they extracted a wide set of
and more exploited to develop educational applica-         linguistic features from the essays and used them
tions, such as Intelligent Computer-Assisted Lan-          to train a binary classification algorithm able to
guage Learning (ICALL) systems (Granger, 2003)             predict the chronological order of two productions
and tools for automated essay scoring (Attali and          written by the same pupil at different times.
Burstein, 2006) or automatic error detection and              The present study ranks among research based
correction (Ng et al., 2013). A fundamental re-            on CItA, but chooses a different approach from the
quirement for developing this kind of applications         one just mentioned: instead of tracking the develop-
is the availability of electronically accessible cor-      ment of students’ writing competence, we focused
pora of learners’ productions. Corpora created so          on the perception of writing quality by Italian L1
far differ in many respects. For instance, consider-       speakers with the aim of understanding whether it
ing the types of examined learners, they can gather        is possible to find the linguistic features that are cru-
productions written by L2 students or by native            cially involved in the distinction between ‘better’
speakers: the former have been built for many lan-         and ‘worse’ essays according to our target reader.
guages (e.g. English, Arabic, German, Hungarian,
Basque, Czech, Italian), while the latter are mainly       Contributions To the best of our knowledge, this
available for English. In both cases, a peculiarity        is the first paper that (i) introduces a dataset of
                                                              1
     Copyright © 2021 for this paper by its authors. Use        The corpus is freely available for research goals at
permitted under Creative Commons License Attribution 4.0   http://www.italianlp.it/resources/cita-corpus-italiano-di-
International (CC BY 4.0).                                 apprendenti-l1/
evaluated essays in terms of perceived writing qual-                                            Number of pairs
                                                                Survey     Selection criteria
                                                                                                I year II year
ity by means of a crowdsourcing task, (ii) deals
                                                                    1      Common prompts          5       5
with the correlation between linguistic features and                2      Narrative              10       0
perceived quality of writing and (iii) assesses the                 3      Narrative               0      10
impact of students errors on quality perception.                    4      Reflexive              10       0
                                                                    5      Reflexive               0      10
                                                                    6      Descriptive             8       2
2     Corpus Collection                                             7      Expository              3       7
                                                                    8      Argumentative           3       7
As previously mentioned, the starting point of our                  9      Error bins             10       0
study was the CItA corpus. It comprises 1,352                      10      Error bins              0      10
essays, written by 156 pupils of seven lower sec-
ondary schools in Rome (three in the historical          Table 1: Criteria used for pairing the essays and number
center and four in the suburbs) during the school        of essays for each survey.
years 2012-2013 and 2013-2014. The productions
respond to 124 writing prompts that pertain to five      chosen essays pertaining to the same textual typol-
textual typologies: reflexive, narrative, descriptive,   ogy – assuming that their similarity with regard to
expository and argumentative. An additional ‘com-        the content could let the annotator focus on stylistic
mon prompt’ was presented at the end of each             issue to orient their judgment – and paired them
school year, in which students were asked to write       according to the school year in which they were
a letter to advise a younger friend how to compose       written. Instead, essays in questionnaires 9 and 10
better essays. The common prompts were aimed at          were paired according to their number of errors:
understanding how learners internalize the different     for each year, we divided the range between the
writing instructions given by teachers.                  minimum amount of errors (0) and the maximum
   Each essay contained in CItA is also provided by      one (49 for the first year, 43 for the second one)
a set of metadata tracking students’ biographical,       into ten error bins and designed the two surveys
sociocultural and sociolinguistic information. Be-       choosing a couple of productions for each bin. Sur-
yond the longitudinal nature, the most significant       veys comparing essays with a similar amount of
novelty introduced by CItA regards error annota-         errors were meant to understand which categories
tion, which was manually performed by a mid-             of errors have a greater impact on human judgment.
dle school teacher according to a new three-level
schema including: the macro-class of error (i.e.         2.2      Human Evaluation
grammatical, orthographic and lexical); the class of     After designing the surveys, we moved on to their
error (i.e. verbs, prepositions, monosyllabes); and      implementation using the QuestBase platform2 .
the corresponding type of modification required to       We defined a three-section structure including the
correct it. More details about the CItA collection       filling-in instructions, the personal data entry form
are reported in Barbagli et al. (2016).                  and the essays evaluation pages.
                                                           Filling-in instructions. The first section reported
2.1    Essay Selection
                                                         the following submission guidelines:
For the purpose of our investigation, we selected
                                                           Ciao!
200 essays from CItA to be submitted to human              Il presente sondaggio è rivolto a partecipanti di
evaluation. The essays ranged from a minimum               madrelingua italiana. La sua compilazione richiede
of 141 tokens to a maximum of 1153 tokens and              circa 20 minuti. Pima di proseguire, dando il consenso
                                                           alla partecipazione, ti spieghiamo in cosa consiste.
their average length was 359.4 tokens. Then, to            Nelle pagine che seguono leggerai dieci coppie di temi
gather judgments on writing quality, we created ten        scritti da studenti del primo e del secondo anno di scuola
questionnaires, each one consisting of ten pairs of        media. I testi possono contenere un certo numero di er-
                                                           rori. Per ciascuna coppia ti chiediamo di indicare quale
essays of the same grade, and distribute them to           dei due temi ritieni sia scritto meglio.
native speakers of all ages and cultural background.       Non esistono risposte giuste o sbagliate: conta semplice-
                                                           mente quello che pensi! Tieni presente che i temi di
   Table 1 reports the criteria we adopted to select       una stessa coppia possono trattare argomenti diversi, ma
the pairs of essays. As it can be seen, Survey 1 al-       questo non deve influire sul tuo giudizio.
lows the comparison between essays responding to           La tua partecipazione al sondaggio è completamente
                                                           libera. Se in qualsiasi momento dovessi cambiare idea
the common prompts written by students attending
                                                            2
the first or the second grades. In surveys 2-8, we              https://story.questbase.com/
                   Figure 1: Comparison of a pair of essays extracted from one of the ten surveys.

  e volessi interrompere il test, potrai farlo liberamente.         Essays evaluation. The third section comprised
  Un’ultima cosa: prima di iniziare il sondaggio, ti chiedi-     ten pages, each occupied by two side by side essays
  amo di darci alcune tue informazioni anagrafiche, che
  serviranno solo a fini statistici. I dati rimarranno comple-   and a field to give the answer (Figure 1). The user
  tamente anonimi e in nessun modo le risposte verranno          had to choose the label ‘1’ if they had preferred the
  associate alla tua persona.                                    first essay, ‘2’ otherwise.
  Se hai dubbi, curiosità o proposte di miglioramento,
  scrivimi all’indirizzo: a.cerulli1@studenti.unipi.it.             After carrying out a pilot study to test the ade-
  Buona lettura!
                                                                 quacy of the structure as well as the completeness
  For the sake of completeness, we also report an                and clearness of the instructions, we started col-
English translation of the same guidelines:                      lecting evaluations. Using Linktree3 we added the
                                                                 ten questionnaires links to a single web page and
  Hello!                                                         shared its link through WhatsApp, Facebook and
  This survey is addressed to Italian native speakers. Its
  submission requires about 20 minutes. By completing            Instagram: clicking on it, users were redirected to
  it, you give your consent to participation. Before going       the page and could access every survey.
  on, we explain to you what it consists of.
  In the following pages you will read ten pairs of essays
  written by Italian L1 learners during the first two years      3       Analysis of Human Judgments
  of lower secondary school. The essays may contain
  linguistic errors. For each pair, you are asked to choose      We collected 223 annotations distributed quite ho-
  the best written of the two essays.                            mogeneously among the ten surveys, except for the
  No answers are right or wrong: you only have to
  express your opinion! Bear in mind that the essays of          first one, submitted 28 times. It is worth to focus
  a pair can concern different topics, but this must not         on the heterogeneous composition of the readers
  affect your judgment.
  Your participation to the survey is completely free. You
                                                                 sample. Concerning sex, the large majority of an-
  may withdraw from it at any time.                              swers (183 units, equal to 82.1%) were given by
  Before starting the survey, we ask you to provide some         women, against the 38 (17%) by men; just two
  personal information that will be used for statistical
  purposes. Data will remain completely anonymous and            people preferred not to specify their gender.
  will not be connected to you in any way.                          Regarding age, we divided the group into six
  If you have doubts, curiosities or improve-                    bins (Figure 2). The most frequent class (97 units)
  ment proposals, please write me to the address:
  a.cerulli1@studenti.unipi.it.                                  was ‘20-24 years’, followed by ‘25-29 years’ (64
  Have a good read!                                              units). This means that most readers (72.5%)
                                                                 ranged from 20 to 29 years of age. 35 evaluations
  Personal data entry form. The surveys were ob-
                                                                 (15.8%) were made by natives between 30 and 39
viously anonymous. However, as we mentioned
                                                                 years of age. People belonging to the remaining
before, we asked the annotators to entry some per-
                                                                 bins contributed to the task for an overall 11.7%.
sonal information (age, sex, education) for statisti-
                                                                     3
cal purposes.                                                            https://linktr.ee/
 Figure 2: Distribution of annotations with respect to      Figure 3: Distribution of annotations with respect to
 readers’ age bins.                                         readers’ education.

   Finally, Figure 3 shows the distribution of sub-        der and calculated the Inter-annotator agreement
missions with respect to readers’ education: 91.9%         (IAA) of the first 15 and 20 annotations. We im-
of annotations were given by people holding an             plemented Krippendorff’s alpha (α), a coefficient
academic degree (118 units, equal to 53.2%) or             that expresses IAA in terms of observed (Do ) and
a high school diploma (86 units, equal to 38.7%).          casual (De ) disagreement (Krippendorff, 2011):
12 annotators (5.4%) had a middle school certifi-
                                                                                          Do
cate; 4 (1,8%) held a doctoral degree; the last two                             α=1−                            (3)
                                                                                          De
indicated a non-specific ‘Other’.
                                                              We noticed that IAA values of the first 15 sub-
3.1   Inter-Annotator Agreement                            missions ordered by their increasing weighted dis-
                                                           tance were the highest. Thus, we took them into
At this point, we defined a selection function to
                                                           account (150 total annotations) for the analysis and
discard inaccurate annotations and obtain the same
                                                           discarded the remaining 734 . It is noteworthy that
number of coherent annotations for each survey.
                                                           the selection led us to an average IAA of 0.26, that
Thus, we firstly built the average vector of every
                                                           is a much higher value than the initial 0.12. Rely-
survey as the set of ten values ‘1’ or ‘2’ chosen
                                                           ing on the selected annotations, we established the
according to the most assigned label to each pair
                                                           ‘winning’ and ‘loser’ essay of each pair.
of essays; then, we calculated the distance between
each survey average vector and all its annotations.        4     Data Analysis
We implemented the euclidean metric generalized
to the n-dimensional space that computes the dis-          We carried out two evaluations: a first one was
tance between two vectors as the square root of the        meant to identify which linguistic features impact
sum of their sizes squared difference:                     more on the human assessment of the writing qual-
                                                           ity; a second one focused on the impact of students
                  v
                  u n
                  uX                                       errors on annotators’ judgments. In what follows
                  t (pk − qk )2                      (1)   we describe the approach underlying the two per-
                     k=1
                                                           spectives and discuss our most interesting findings.
   To give relevance to the deviating degree of an-
                                                           4.1    Linguistic Profiling and Stylistic Analysis
swers differing from the average, we assigned every
pair a weight (wk ) equal to the number of times in        The first analysis relies on linguistic profiling, a
which the ‘winning’ essay was chosen; then, we             NLP-based methodology in which a large set of
computed the weighted distance between annota-             linguistically-motivated features automatically ex-
tions and average vectors.                                 tracted from annotated texts are used to obtain a
                 v                                         vector-based representation of it. Such representa-
                 u n
                 uX                                        tions can be then compared across texts representa-
                 t   wk (pk − qk )2                  (2)   tive of different textual genres and varieties to iden-
                    k=1
                                                           tify the peculiarities of each (Montemagni, 2013;
   Finally, we ranked weighted and unweighted                  4
                                                                 The corpus of evaluated essays is available at
distance values of each survey in ascending or-            http://www.italianlp.it/EvaluatedEssays.zip
                                  ‘Winning’           ‘Losers’                                 ‘Winning’       ‘Losers’
 Feature                                                             Feature
                                Avg.     SD        Avg.     SD                               Avg.     SD    Avg.     SD
 n tokens                       374.9 127.4        342.7 116.3       verbs tense dist Fut     2.75   4.37    2.47    6.90
 ttr form chunks 100             0.72   0.06        0.70    0.06     dep dist cop             1.85   0.98    1.93    1.24
 upos dist NOUN                 16.31   2.49       16.98    2.63     dep dist flat:foreign    0.03   0.14    0.02    0.17
 verbs tense dist Fut            2.75   4.37        2.47    6.90     dep dist flat:name       0.31   0.52    0.32    0.79
 verbs form dist Ger             3.13   3.52        2.32    3.25     dep dist det:predet      0.27   0.26    0.24    0.30
 aux mood dist Sub               4.41   7.22        2.48    4.51     dep dist parataxis       0.13   0.21    0.15    0.31
 n prepositional chains         10.70   6.28        9.50    5.98     obj pre                 31.35 13.02    30.02 15.87
                                                                     verb edges dist 0        1.23   1.62    1.06    1.74
                                                                     verb edges dist 1       13.45   5.44   12.48    6.30
Table 2: Linguistic features whose average varies sig-               upos dist CCONJ          4.17   1.28    4.51    1.61
nificantly between the two subsets.

                                                                   Table 3: The 10 features that, maximally varying in
van Halteren, 2004). To perform the analysis, we                   ‘loser’ essays, are more uniform in the ‘winning’ ones.
relied on Profiling-UD5 , a recently introduced tool
that allows the extraction of a wide set of lexical,
                                                                   grades. Secondly, we noticed that a richer vocab-
morpho-syntactic and syntactic features from texts
                                                                   ulary (ttr form chunks 100) plays a crucial role in
linguistically annotated according to the Universal
                                                                   native’s judgment. This is in line with another ad-
Dependencies (UD)6 formalism. These features,
                                                                   vice of the just mentioned ranking, Usa un vocabo-
described in details in Brunato et al. (2020), have
                                                                   lario ricco ed espressivo (“Use a rich and expres-
been shown to be involved in many tasks, all re-
                                                                   sive vocabulary”), that reflects teachers’ encour-
lated to modeling the form rather than the content
                                                                   agement to use synonyms in order to write clearer
of a text, such as the assessment of text readability
                                                                   and more readable compositions. Values related
and linguistic complexity and the identification of
                                                                   to the third feature (upos dist NOUN) reveal that
stylistic traits of an author or groups of authors.
                                                                   ‘loser’ essays present a slightly higher distribution
   We thus split our annotated corpus into two sec-
                                                                   of nouns. A predominant use of nouns is typical
tions: one comprised all ‘winning’ essays and the
                                                                   of highly informative texts (e.g. newspaper arti-
other all ‘loser’ ones. Using Profiling-UD, we ex-
                                                                   cles, laws), while genres closer to speech contain
tracted for each text of the two subsets a feature-
                                                                   more verbs (Montemagni, 2013). Belonging to the
based vector representation. For each considered
                                                                   second category, a school essay with fewer nouns
feature we calculated the average value, the stan-
                                                  SD               is probably perceived as more coherent with its
dard deviation and the coefficient of variation ( Avg )
                                                                   genre. Concerning verbal inflection, ‘better’ pro-
in the two subsets and we assessed whether the vari-
                                                                   ductions include, on average, 0.28% more future
ation between mean values was significant using
                                                                   verbs (verbs tense dist Fut), 0.81% more gerund
the Wilcoxon rank sum test.
                                                                   verbs (verbs form dist Ger) and 1.93% more sub-
   Table 2 shows the seven linguistic features                     junctive auxiliary verbs (aux mood dist Sub). Ver-
whose variation turned out to be statistically sig-                bal tenses differing from present and moods dif-
nificant (p − value < 0.05), ordered by increas-                   fering from indicative require elevated linguis-
ing p-values. It emerges that ‘winning’ essays                     tic skills, which positively influence annotators’
are on average longer (32.2 tokens more) than                      choices. The last feature significantly varying be-
the ‘losers’ (n tokens), a finding that may suggest                tween the two groups is the number of prepositional
that longer compositions are evaluated as more                     chains (n prepositional chains): ‘winning’ compo-
reasoned, structured and content-rich. Interest-                   sitions have, on average, 1.2 more of them.
ingly, this also reflects the students’ perception
                                                                      A further study was focused on the variability de-
of school writing: Barbagli et al. (2015) showed
                                                                   gree of linguistic features in the two essay groups.
that two of the most frequent suggestions contained
                                                                   For each subset, we ordered the features by their in-
in essays that respond to ‘common prompts’ are
                                                                   creasingly coefficients of variation; then, we calcu-
Leggi/scrivi molto (“Read/write a lot”) and Lavora
                                                                   lated the difference between the two rankings in or-
sodo, fai vedere che ti impegni (“Work hard, show
                                                                   der to identify the features that were maximally uni-
your dedication”). Thus, pupils possibly write
                                                                   formly distributed in ‘better’ essays as compared
more so as to show their dedication and get higher
                                                                   to the ‘worse’ ones (Table 3). It can be noticed
   5
       http://linguistic-profiling.italianlp.it/                   that future verbs (verbs tense dist Fut) are very
   6
       https://universaldependencies.org/                          uniformly distributed among ‘better’ essays. We
have previously commented that their frequency                             ‘Winning’ essays   ‘Loser’ essays
                                                             Category
                                                                           Avg.      SD       Avg.     SD
is higher in the ‘winners’; it proves again that na-
                                                             Grammar       3.28    5.516      4.57    6.126
tives interpret the use of complex verbal forms              Orthography   3.18    4.517      4.03    4.826
as an indicator of higher skills. Also parataxis
distribution (dep dist parataxis) is quite uniform       Table 4: Error categories whose average varies signifi-
in ‘winning’ essays; however, its average value is       cantly between the two subsets.
higher in the ‘loser’ ones. It can be deduced that
annotators prefer hypotaxis but this is not surpris-     year; moreover, Errori di ortografia (“Orthography
ing: hypotactic periods are more structured and          errors”) occupies the 6th and the 1st position among
elegant and require refined abilities to be built. The   the most salient terms respectively of the first and
same evidence is given on the morphosyntactic            the second year. The non-significant variations of
level (upos dist CCONJ), since ‘better’ composi-         lexical (p − value = 0.581) and punctuation er-
tions include 0.34% less coordinating conjunctions.      rors (p − value = 0.617) are probably due to their
Curiously, ‘better’ essays have, on average, 0.1%        scarce amount in the analysed essays.
more foreign terms (dep dist flat:foreign); this may
suggest that annotators appreciate these expres-         5   Conclusions
sions. Finally, it is worth highlighting a higher and
more uniform percentage of verbs with few mod-           We presented a pilot study towards the identifica-
ifiers in the ‘winning’ essays (verb edges dist 0,       tion of the linguistic features that are own of well
verb edges dist 1).                                      written perceived essays. We collected Italian na-
                                                         tives’ preferences on 100 pairs of essays written by
                                                         L1 students, that we analysed in terms of linguistic
4.2   Students Errors Impact
                                                         profiling and errors distribution. Our results reveal
The last analysis was aimed at assessing whether         an interesting correspondence between annotators’
and in what measure students errors impact on hu-        judging criteria and writing instructions that L1
man judgments. We counted the pairs of essays            learners receive by teachers. Our findings could be
whose ‘winning’ composition had a lower number           interpreted as an indicator of the reliability of our
of errors, those in which the ‘loser’ one had more       data and, more in general, could suggest the effec-
mistakes and those with an equal number of errors.       tiveness of crowdsourcing methods to quickly build
We noticed that essays with fewer errors had won         large and reliable datasets. Considering the lack
in 56% cases, reaching the 79% if including pairs        of Italian corpora of graded essays, such datasets
with the same number of errors. This procedure           could be valuable resources for the development of
gave a first empirical answer to our starting ques-      Computer-Assisted Learning Systems.
tion: errors substantially affect human assessment.         The limited size of our dataset certainly reduced
   At this point, we focused on error categories to      the amount of results. Thus, we have to expand it (i)
identify which ones affect more the perception of        by collecting more annotations for the already exist-
writing quality. For each category, we calculated        ing surveys and (ii) by creating and distributing new
the average number of errors and their standard de-      surveys in order to gather judgments on new pairs
viation in both subsets; then, relying on Wilcoxon       of essays. Analysis on the enlarged dataset could
rank sum test, we found out that grammatical and         provide more features that are own of good essays.
orthographic mistakes vary significantly between         Following the model of Miaschi et al. (2021), we
the two groups (Table 4). As expected, ‘loser’           could use the results to train a classifier that, given
essays have, on average, 1.29 more grammatical           a pair of essays, recognizes the best written one.
errors and 0.85 more orthographic errors. It is             The tool would not presume to replace teachers,
worth to add that orthographic mistakes variation        but it could be a valuable teaching aid. Students
(p − value = 0.007) is more significant than the         could use it to get an immediate and preliminary
other (p − value = 0.029). This could mean that          self-assessment on their written productions so as
natives judge deviations in orthography worse than       to better understand their mistakes and hopefully
those in grammar. Once again, our findings are in        avoid repeating them. Such tools can be very useful
line with Barbagli et al. (2015): Usa una corretta       if integrated into educational processes based on
ortografia (“Use correct orthography”) is the 2nd of     distance learning paradigms, which need adequate
the most frequent suggestions given in the second        technological infrastructures to be really efficient.
References
Yigal Attali and Jill Burstein. 2006. Automated Essay
  Scoring With e-rater® V. 2. The Journal of Technol-
  ogy, Learning, and Assessment, 4(3).
Alessia Barbagli, Pietro Lucisano, Felice Dell’Orletta,
  and Giulia Venturi. 2015. Il ruolo delle tecnologie
  del linguaggio nel monitoraggio dell’evoluzione delle
  abilità di scrittura: primi risultati. Italian Journal of
  Computational Linguistics (IJCoL), 1(1):99–117.
Alessia Barbagli, Lucisano Pietro, Felice Dell’Orletta,
  Simonetta Montemagni, and Giulia Venturi. 2016.
  Cita: an L1 Italian Learners Corpus to Study the De-
  velopment of Writing Competence. In Proceedings
  of the Tenth International Conference on Language
  Resources and Evaluation (LREC’16), pages 88–95,
  Portorož, Slovenia. European Language Resources
  Association (ELRA).

Dominique Brunato, Andrea Cimino, Felice
  Dell’Orletta, Simonetta Montemagni, and Giu-
  lia Venturi. 2020.     Profiling-UD: a Tool for
  Linguistic Profiling of Texts. In Proceedings
  of the 12th Conference of Language Resources
  and Evaluation (LREC 2020), pages 7145–7151,
  Marseille, France. European Language Resources
  Association (ELRA).
Sylviane Granger. 2003. Error-tagged learner corpora
  and CALL: A promising synergy. CALICO Journal,
  20(3):465–480.
Hans van Halteren. 2004. Linguistic profiling for au-
  thor recognition and verification. In Proceedings of
  the Association for Computational Linguistics, pages
  200–207.
Klaus Krippendorff. 2011. Computing Krippendorff’s
  Alpha-Reliability. Technical report, University of
  Pennsylvania.
Alessio Miaschi, Dominique Brunato, and Felice
  Dell’Orletta. 2021. A NLP-based stylometric ap-
  proach for tracking the evolution of l1 written lan-
  guage competence. Journal of Writing Research
  (JoWR), 13(1):71–105.
Simonetta Montemagni. 2013. Tecnologie linguistico-
  computazionali e monitoraggio della lingua italiana.
  Studi Italiani di Linguistica Teorica e Applicata
  (SILTA), pages 145–172.
Hwee Tou Ng, Siew Mei Wu, Yuanbin Wu, Christian
 Hadiwinoto, and Joel Tetreault. 2013. The CoNLL-
 2013 Shared Task on Grammatical Error Correction.
 In Proceedings of the Seventeenth Conference on
 Computational Natural Language Learning: Shared
 Task, pages 1–12, Sofia, Bulgaria. Association for
 Computational Linguistics.