<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic Error Detection: Comparing AI vs. Human Performance on L2 Italian Texts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Irene Fioravanti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luciana Forti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefania Spina</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University for Foreigners of Perugia</institution>
          ,
          <addr-line>Piazza Fortebraccio 4, 06123 Perugia</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper reports on a study aimed at comparing AI vs. human performance in detecting and categorising errors in L2 Italian texts. Four LLMs were considered: ChatGPT, Copilot, Gemini and Llama3. Two groups of human annotators were involved: L1 and L2 speakers of Italian. A gold standard set of annotations was developed. A fine-grained annotation scheme was adopted, to reflect the specific traits of Italian morphosyntax, with related potential learner errors. Overall, we found that human annotation outperforms AI, with some degree of variation with respect to specific error types. We interpret this as a possible effect of the over-reliance on English as main language used in NLP tasks. We, thus, support a more widespread consideration of different languages.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Error detection</kwd>
        <kwd>error correction</kwd>
        <kwd>artificial intelligence</kwd>
        <kwd>large language models</kwd>
        <kwd>L2 Italian</kwd>
        <kwd>1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Identifying errors in texts written by second language
(L2) learners is a relevant task in several research areas,
which can also have practical applications in a variety of
fields. Error analysis is a traditional approach adopted in
second language acquisition research for decades
        <xref ref-type="bibr" rid="ref5">(Corder 1967)</xref>
        , which learner corpus research has more
recently revisited in light of the availability of learner
corpora and corpus-based methods of analysis
        <xref ref-type="bibr" rid="ref7">(Dagneaux et al. 1998)</xref>
        . In addition, acquisitional
research on learners’ errors has relevant pedagogical
implications involving error-related feedback:
appropriate corrective feedback can lead to improved
writing skills in both L1 and L2 writing
        <xref ref-type="bibr" rid="ref1">(Biber et al.
2011)</xref>
        . Furthermore, automatic error detection and
categorisation is key in language testing and assessment
research and practice, with reference to automated essay
scoring
        <xref ref-type="bibr" rid="ref14">(e.g., Song 2024)</xref>
        , which has important
implications for rubric descriptors.
      </p>
      <p>The interest of Natural Language Processing (NLP)
in grammatical error correction (GEC) and grammatical
error detection (GED) relies on the creation of systems
used in Intelligent Computer-Assisted Language
Learning (ICALL), Automated Essay Scoring (AES) or
Automatic Writing Evaluation (AWE) contexts. ICALL
systems integrate NLP techniques into CALL platforms,
providing learners with flexible and dynamic
interactions in their learning process. AES systems
automatically grade written texts with machine learning
techniques, as well as AWE systems, which also provide
learners with feedback.</p>
      <p>Identifying and annotating errors in the
performance of L2 learners, while beneficial for both
pedagogical and research purposes, presents
considerable challenges. This process is typically
conducted manually in the case of learner corpora due
to the inherent nature of errors as latent phenomena.
The manual identification of learners’ errors requires a
substantial degree of subjective judgment by human
annotators (Dobrić 2023), as well as a considerable
investment in terms of time.</p>
      <p>The present study aims to contribute to the
evaluation of the performance of Large Language
Models (LLMs) in the task of automatic grammatical
error detection (GED) in written texts produced by L2
learners. In particular:
1.</p>
      <p>
        it evaluates the behaviour of different LLMs in
relation to an error detection task in written
texts produced by L2 learners of Italian, a
language other than English, in line with
recent studies criticising the over-reliance on
English in NLP research
        <xref ref-type="bibr" rid="ref16">(Søgaard 2022)</xref>
        and
seeking to contribute to the very few studies
that do consider languages other English
        <xref ref-type="bibr" rid="ref18 ref18">(e.g.,
MultiGED-2023; Volodina et a. 2023)</xref>
        ;
it targets specific error types and grammatical
categories in order to mitigate the problems
arising from the broadness of the notion of
error, focusing on clear-cut and possibly
unambiguous error categories;
it relies on a high degree of accuracy in error
annotation, which was manually performed by
three researchers on a small learner dataset
serving as the test set on which the systems are
evaluated;
it assesses the performance of LLMs in error
detection and categorisation, through a
comparison with the performance of native
Italian students and advanced learners of L2
Italian on the same task.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>
        Research on automatic error detection in L2 written
texts, mainly adopting machine learning approaches,
has significantly developed in recent years
        <xref ref-type="bibr" rid="ref18 ref4">(Bryant et al.
2023)</xref>
        , especially within the framework of shared tasks
focused on GED and GEC. For instance, Di Nuovo et al.
(2019; 2022) implemented a novel Italian treebank which
includes texts written by learners of Italian. An
annotation scheme suitable for L2 production was
proposed encompassing UD and error annotation.
      </p>
      <p>
        The CoNLL-2014 Shared Task on Grammatical Error
Correction
        <xref ref-type="bibr" rid="ref13">(Ng et al. 2014)</xref>
        was based on the
identification of 28 error types involving major
grammatical categories as well as spelling and
punctuation errors. The test set consisted of 50 essays on
two different topics, written by 25 learners of L2 English,
that were error-annotated by two native speakers. The
BEA Grammatical Error Correction shared task
        <xref ref-type="bibr" rid="ref2">(Bryant
et al. 2019)</xref>
        used a larger dataset (350 essays written by
334 learners and native speakers of English) and a
similar taxonomy consisting of 25 error types. More
recently, the NLP4CALL shared task on Multilingual
Grammatical Error Detection
        <xref ref-type="bibr" rid="ref18">(MultiGED-2023; Volodina
et al. 2023)</xref>
        was the first multilingual shared task
including four languages in addition to English: Czech,
German, Italian and Swedish. The datasets used for the
task varied across languages: the Italian dataset
consisted of 813 written learner texts. Participants
mainly used systems based on pre-trained LLMs.
      </p>
      <p>A recent study by Kruijsbergen et al. (2024) focused
on L1 and L2 Dutch and explored the capabilities of
LLMs in written error detection, with both a fine-tuning
and a zero-shot approach through prompting a
generative language model (GPT-3.5). Results highlight
that the fine-tuning approach largely outperforms
zeroshotting, both for L1 and L2.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>To evaluate AI performance in automatic GED on L2
written texts, we designed our study based on the
following stages: selection of the text sample; error type
identification; definition of the gold standard
(henceforth, GS); evaluation of LLMs’ annotations;
comparison between LLMs and human performance.</p>
      <sec id="sec-3-1">
        <title>3.1. Sample texts</title>
        <p>
          We used authentic L2 data derived from a learner corpus
of Italian, the CELI corpus
          <xref ref-type="bibr" rid="ref16 ref17">(Spina et al., 2022; Spina et al.,
2024)</xref>
          . It is a pseudo-longitudinal corpus of L2 Italian,
representative of written Italian produced by
intermediate and advanced learners. The CELI corpus is
made of four subcorpora, one for each proficiency level
(B1; B2; C1; C2) equally designed in terms of tokens.
Eleven texts were randomly selected from the B1
subcorpus, of the total size of 1,335 tokens. We focused
on morphosyntactic errors only. We chose to extract our
texts from the B1 level, assuming they would be
characterised by a higher number of morphosyntactic
errors compared to higher proficiency levels. To make
the annotation task easier, we divided each text into
sentences. Details about the sentences’ sample can be
found in Table 1.
        </p>
        <sec id="sec-3-1-1">
          <title>Total number of sentences</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>Average and range of sentences’ length (in tokens) Range of number of sentences in 5-7 each text</title>
          <p>Table 1. Description of the sentences’ sample.
67</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Error type identification</title>
        <p>
          Contrary to previous study
          <xref ref-type="bibr" rid="ref13 ref2">(Ng et al. 2014; Bryant et al.
2019)</xref>
          that employed a broad notion of error, we focused
only on specific morphosyntactic errors (selection (S),
addiction (A), omission (O), ending (E)) within four Parts
of Speech (PoS; articles (A), prepositions (P), nouns (N),
verbs (V)), for a total of eight error types (Table 1). This
choice was due to the fact that Italian is a
morphologically rich language, and that the four
selected grammatical categories are a frequent source of
errors for learners.
        </p>
        <p>Type
AS</p>
        <p>PoS
article</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Annotation</title>
        <p>The outputs of the four LLMs were compared to a
benchmark (GS) obtained from the annotation of three
researchers. Three Italian trained linguists (i.e., the three
authors of this paper) manually annotated the sample
texts. The three researchers annotated only the error
types described above. Initially there was a substantial
agreement between the three linguists (k = 0.61). The
three annotators disagreed mostly on the PA error (k =
0.39). Any inter-annotator disagreements were resolved
through negotiation until a partial agreement (i.e., two
annotators out of three) was reached. The agreement
turned out to be improved (k = 0.81). Then, all the
remaining disagreements (i.e., the cases that reach a
partial agreement) were resolved reaching a perfect
annotator agreement prioritising the two annotators’
decision over the third one (k = 1). In the GS, 47
grammatical errors were identified with an average of 4
errors per text, while no errors were found in 32
sentences. On average, each sentence contained 2 errors.
ChatGPT-4o (July 2024 version), Copilot, Gemini and
Llama3 were evaluated. Several steps were followed to
arrive at the final prompt, which can be found in
Appendix A. We started giving the prompt in Italian and
presenting all the texts together. However, the four
LLMs, which were not pre-trained, were able to find a
small number of errors. We, then, proposed the prompt
in Italian again, repeating the instructions for each text.
In this case, the LLMs identified types of errors that were
not required. Following the recommendations from
Kruijsbergen et al. (2024) on the prompt’s language, the
entire prompt was then given in English. The
performance improved as a greater number of errors
were identified, but still types of errors that were not
required. Therefore, we gave a more detailed prompt in
English following the recommendation of Coyne et al.
(2023). Definitions of the four Italian PoS were provided.
Further, we listed the eight error types with descriptions
and examples. The texts were presented in numbered
sentences. LLMs were instructed to classify each
detected error and were informed that there could be
more than one error in a sentence as well as no errors at
all. The entire prompt was repeated for each text. This
last version of the prompt was used for this study.
Subsequently, we calculated the inter-annotator
agreement between the four LLMs, which resulted to be
weak (k = 0.21).
3.3.2. Human annotator groups</p>
        <p>LLMs’ performance was also compared to two
human groups. Twenty-two L1 (age range: 19-50) and
Twenty-seven L2 speakers (age range: 22-40) of Italian
took part in the annotation task. They were
undergraduate and postgraduate students in humanities
and social studies. They were asked to annotate only the
error types described above, with definition and
examples provided for each type of error. They were also
asked to report the incorrect form and to provide the
correct one. Then, we calculated the inter-annotator
agreement between the raters of the two groups. L1
speakers reached a good agreement (k = 0.52), while the
agreement between L2 speakers was poor (k = 0.33).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation</title>
      <p>
        Four measures were used to compare the performance
of LLMs and human annotators in detecting errors:
Accuracy, Precision (P), Recall (R) and F-score (Fß).
Accuracy was calculated by dividing the number of
correctly identified errors by the total number of
annotated errors. To be consistent with previous works
in GED
        <xref ref-type="bibr" rid="ref18">(Volodina et al. 2023)</xref>
        , F-score was set to 0.5
given that it weights P twice as much as R (i.e., it is more
important that a system makes a correct prediction, than
being able to detect all errors).
      </p>
      <sec id="sec-4-1">
        <title>4.1. Overall error detection</title>
        <p>Gemini outperformed the other three systems,
demonstrating the highest accuracy (65,52%). In
contrast, Llama 3 turned out to be less accurate in
comparison to the others (51,72%). ChatGPT and Copilot
behaved similarly in terms of accuracy (57,47%). LLMs
were less accurate than human groups in detecting
errors, as L1 and L2 speakers reached much higher
values of accuracy (89,66% and 78,16% respectively).</p>
        <p>When looking at AI performance, Copilot and
Llama3 showed worse P than ChatGPT and Gemini,
indicating that they had low ability in detecting true
error instances. Conversely, Gemini and Copilot were
able to detect a higher number of errors compared to
ChatGPT and Llama3. ChatGPT made the best
predictions, while Gemini had better R. Human groups
outperformed AI systems for R, P, and F-score (Table 2).
L1 speakers were able to detect almost all errors and to
make correct predictions. On the contrary, L2 speakers
had better P and worse R, suggesting they had lowest
number of FP but a reduced ability to detect TP.</p>
        <p>Figure 1 shows the performance of each group in
terms of P and R.</p>
        <p>P (%)
65.22
34.78
58.69
45.65
93.02
93.55</p>
        <p>R (%)
58.82
69.56
71.05
55.26
89.96
63.04</p>
        <p>Fß
63.83
66.75
60.81
47.29
92.39
85.29</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Error type detection</title>
        <p>To examine thoroughly the performance of the LLMs in
GED, we calculated R, P and F-score metrics for each of
the eight error types (Table 3).</p>
        <p>R(%)</p>
        <p>Fß (%)
Error type
ChatGPT
Copilot
Gemini
Llama3</p>
        <p>AO
AS
AA
NE
VE
PO
PA
PS
AO
AS
AA
NE
VE
PO
PA
PS
AO
AS
AA
NE
VE
PO
PA
PS
AO
AS
AA
NE
VE
PO
PA
PS</p>
        <p>P(%)
50
20
/
20
L2 speakers</p>
        <p>AO</p>
        <p>Copilot, Gemini, and Llama3 failed to detect various
error types exhibiting a high number of FP without
detecting true instances. Copilot showed a fair
prediction of VE and PS errors. Gemini had better R and
P in detecting and correctly predicting AO and VE
errors. However, it performed worse on PS errors in
terms of both P and R. Llama3 was able to predict AS,
VE, and PS errors but showing low values of R. ChatGPT
turned out to be the best in predicting all error types,
except for the AA error. ChatGPT showed high values
of P in the prediction of AO, PA, and PO errors and
showed low values of P and R for PS errors.</p>
        <p>Human groups performed better than LLMs in
detecting each error type. L1 speakers exhibited high
values of R and P in detecting all error types but
performed less well in making correct predictions on PS
errors. L2 speakers demonstrated better R and P in
detecting AO and AS errors. Conversely, they were
unable to identify all AA errors. Furthermore, they
showed a reduced ability in detecting all PO errors and
in predicting them correctly.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and conclusion</title>
      <p>The main aim of our paper was to investigate
whether AI can be a valid support for second language
acquisition research, in learner error detection, with
specific reference to a language other than English, i.e.,
Italian. Our study compared the performance of four
LLMs among them and also compared with L1 and L2
annotators. A GS, produced by the annotations of three
trained linguists, was adopted as benchmark. Given the
richness of Italian morphosyntax and the variety of
possible morphosyntactic errors that L2 Italian learners
may produce, contrary to the few other studies on
Italian, this study considered three different error types
for two of the parts of speech listed in Table 1, i.e. article
and preposition. This methodological novelty can
potentially lead to much more fine-grained results,
while counterbalancing, like in our case, the low number
of annotated texts.</p>
      <p>
        The general finding about human annotators
performing better than LLMs, both in terms of overall
error detection and in terms of error type detection, is
particularly significant if we consider the structural
differences between English and other languages.
Italian, like many other languages, is characterised by
rich morphosyntatic traits, which inevitably have a
considerable impact on the computational processing of
L1 and L2 texts. Our findings may thus be a reflection of
the well-known language bias in NLP, linked to the
dominance of English, which then leads to a number of
scientific but also social inequalities
        <xref ref-type="bibr" rid="ref16 ref18">(Søgaard 2022;
Volodina et al. 2023)</xref>
        . Repeating the study with
pretrained LLMs might improve their performance. At
present, pivotal tasks such as automatic error detection
and classification, performed on a morphologically rich
language such as Italian, does not seem to be viable with
LLMs, as they do not add effectiveness to the same task
performed manually.Future developments of this study
may also include fine-tuned models, which are generally
indicated as potentially better-performing than
nontuned ones
        <xref ref-type="bibr" rid="ref11">(Kruijsbergen et al. 2024)</xref>
        , as well as an
increased number of annotated texts and an even more
fine-grained and extended error annotation scheme.
Automatic error detection and classification can be
crucial for both the development of online language
assessment systems and for second language acquisition
research as a whole. This is especially true for languages
other than English, which continue to be severely
underrepresented in all domains of language sciences,
including NLP.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This study was conducted in the context of CARLA –
Corpus Approaches to Research on Language, a research
group affiliated with the Department of Italian language,
literature and arts in the world (University for
Foreigners of Perugia, Italy).</p>
      <p>Appendix A
The Prompt
In this task, we present a text in Italian, produced by a
learner of L2 Italian at B1 proficiency level.
The text is numbered and divided into numbered
sentences. For each sentence, you will have to identify
specific errors, if any.</p>
      <p>The errors considered in this task involve articles (in
Italian "il, lo, la, i, gli, le, un, uno, una"), prepositions (in
Italian "di, a, da, in, con, su, per, tra, fra", in their simple
forms or associated with articles "del, dalla, negli, etc.",
nouns, and verbs).</p>
      <p>For each error, you will have to indicate the type, which
you can choose from the following list:
1a: Article addition: the learner has added an article
where it was not necessary (e.g. "Ho fatto la fatica a
salire le scale": "la" should not have been used);
1b: Article omission: the learner did not use the article
even though it was necessary (e.g. "Maria ha fatto
compromesso con il suo capo": "un" should have been
used before "compromesso");
1c: Article choice: the learner used the wrong article (e.g.
"In montagna ci sono i alberi sempreverdi": "i" is wrong,
the correct article is "gli");
2: Verb ending: the verb ending is incorrect (e.g. "Ieri
Luca andavo al mare": "andavo" has the wrong ending
"o", the correct one is "a" --&gt; "andava");
3: Noun ending: the noun ending is incorrect (e.g. "Ho
comprato tre mela gialle": "mela" has the wrong ending
"a", the correct one is "e" --&gt; "mele");
4a: Preposition addition: the learner added a preposition
where it was not necessary (e.g. "Ho comprato a un
libro": "a" should not have been used);
4b: Preposition omission: the learner did not use a
preposition even though it was necessary (e.g. "Anna Ë
andata casa": the preposition "a" is missing before
"casa");
4c: Preposition choice: the learner used the wrong
preposition (e.g. "Questo Ë il libro a mio professore": "a"
is wrong, the right preposition was "del").</p>
      <p>It is possible that there is more than one error in a
sentence, but also that there are no errors at all.
If you find no errors, do not indicate anything and move
on to the next sentence.</p>
      <p>Here is the text with the numbered sentences.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Biber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nekrasova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Horn</surname>
          </string-name>
          ,
          <article-title>The Effectiveness of Feedback for L1-English and L2-Writing Development: a Meta-Analysis</article-title>
          ,
          <source>ETS Research Report Series</source>
          (
          <year>2011</year>
          ),
          <volume>1</volume>
          ,
          <fpage>i</fpage>
          -
          <volume>99</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bryant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Felice</surname>
          </string-name>
          , Ø. E. Andersen, T. Briscoe,
          <article-title>The bea-2019 shared task on grammatical error correction</article-title>
          , in: H.
          <string-name>
            <surname>Yannakoudakis</surname>
          </string-name>
          et al. (Eds.),
          <source>Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>52</fpage>
          -
          <lpage>75</lpage>
          , doi: 10.18653/v1/
          <fpage>W19</fpage>
          -4406.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bryant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Felice</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Briscoe</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Automatic Annotation and Evaluation of Error Types for Grammatical Error Correction</article-title>
          , in: R. Barzilay, M.-Y. Kan (Eds.),
          <source>Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2017</year>
          , pp.
          <fpage>793</fpage>
          -
          <lpage>805</lpage>
          , doi: 10.18653/v1/
          <fpage>P17</fpage>
          - 1074.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bryant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Reza</given-names>
            <surname>Qorib</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. Tou</given-names>
            <surname>Ng</surname>
          </string-name>
          , T. Briscoe,
          <article-title>Grammatical Error Correction: A Survey of the State of the Art</article-title>
          , Computational
          <string-name>
            <surname>Linguistics</surname>
          </string-name>
          (
          <year>2023</year>
          ),
          <volume>49</volume>
          (
          <issue>3</issue>
          ),
          <fpage>643</fpage>
          -
          <lpage>701</lpage>
          , doi: 10.1162/coli a 00478.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Corder</surname>
          </string-name>
          , The Significance of Learners' Errors,
          <source>International Review of Applied Linguistics in Language Teaching</source>
          (
          <year>1967</year>
          ),
          <volume>5</volume>
          ,
          <fpage>161</fpage>
          -
          <lpage>170</lpage>
          , doi: 10.1515/iral.
          <year>1967</year>
          .
          <volume>5</volume>
          .1-
          <fpage>4</fpage>
          .
          <fpage>161</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Coyne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sakaguchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Galvan-Sosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Inui</surname>
          </string-name>
          ,
          <article-title>Analyzing the performance of gpt-3.5 and gpt-4 in grammatical error correction (</article-title>
          <year>2023</year>
          ), arXiv:
          <fpage>2303</fpage>
          .
          <fpage>14342</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Dagneaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Denness</surname>
          </string-name>
          , S. Granger,
          <article-title>ComputerAided Error Analysis</article-title>
          ,
          <string-name>
            <surname>System</surname>
          </string-name>
          (
          <year>1998</year>
          ),
          <volume>26</volume>
          (
          <issue>2</issue>
          ),
          <fpage>163</fpage>
          -
          <lpage>174</lpage>
          , doi: 10.1016/
          <fpage>S0346</fpage>
          -251X(
          <issue>98</issue>
          )
          <fpage>00001</fpage>
          -
          <lpage>3</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Di Nuovo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mazzei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          ,
          <article-title>Towards an Italian learner treebank in universal dependencies</article-title>
          , in: R. Bernardi et al. (Eds.),
          <source>CLiT: CEUR Workshop Proceedings (Volume: 2481)</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Di Nuovo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mazzei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Corino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          , Valico-UD:
          <article-title>Treebanking an Italian Learner Corpus In Universal Depencies</article-title>
          ,
          <source>Italian Journal of Computational Linguistics</source>
          (
          <year>2022</year>
          ),
          <volume>8</volume>
          (
          <issue>1</issue>
          ), doi: 10.4000/ijcol.1007
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dobrić</surname>
          </string-name>
          ,
          <article-title>Identifying errors in a learner corpus - the two stages of error location vs. error description and consequences for measuring and reporting inter-annotator agreement</article-title>
          , Applied Corpus Linguistics (
          <year>2023</year>
          ),
          <volume>3</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          , doi: 10.1016/j.acorp.
          <year>2022</year>
          .
          <volume>100039</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kruijsbergen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Van</given-names>
            <surname>Geertruyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Hoste</surname>
          </string-name>
          , O. De Clercq,
          <article-title>Exploring LLMs' capabilities for error detection in Dutch L1 and L2 writing products, Computational Linguistics in the Netherlands Journal (</article-title>
          <year>2024</year>
          ),
          <volume>13</volume>
          ,
          <fpage>173</fpage>
          -
          <lpage>191</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Leacock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chodorow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gamon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tetreault</surname>
          </string-name>
          ,
          <source>Automated Grammatical Error Detection for Language Learners</source>
          , Morgan &amp; Claypool,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Briscoe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hadiwinoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. H.</given-names>
            <surname>Susanto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bryant</surname>
          </string-name>
          ,
          <article-title>The CoNLL-2014 shared task on grammatical error correction</article-title>
          , in: H. T. Ng,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Briscoe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hadiwinoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. H.</given-names>
            <surname>Susanto</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          Bryant (Eds.),
          <source>Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          , doi: 10.3115/v1/
          <fpage>W14</fpage>
          -1701.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <source>Automated Essay Scoring and Revising Based on Open-Source Large Language Models, IEEE Transations on Learning Technologies</source>
          ,
          <year>2024</year>
          ,
          <volume>17</volume>
          , pp.
          <fpage>1920</fpage>
          -
          <lpage>1930</lpage>
          , doi: 10.1109/TLT.
          <year>2024</year>
          .
          <volume>3396873</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Søgaard</surname>
          </string-name>
          .
          <article-title>Should we ban English NLP for a year</article-title>
          ? In: Y.
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Kozareva</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          ,
          <source>Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>5254</fpage>
          -
          <lpage>5260</lpage>
          , doi: 10.18653/v1/
          <year>2022</year>
          .emnlpmain.
          <volume>35</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Fioravanti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Forti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Santucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Scerra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zanda</surname>
          </string-name>
          ,
          <article-title>Il corpus CELI: una nuova risorsa per studiare l'acquisizione dell'italiano L2, Italiano LinguaDue (</article-title>
          <year>2022</year>
          ),
          <volume>14</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>116</fpage>
          -
          <lpage>138</lpage>
          , doi: 10.54103/2037-3597/1. I.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Fioravanti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Forti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zanda</surname>
          </string-name>
          ,
          <article-title>The CELI Corpus: design and linguistic annotation of a new online learner corpus</article-title>
          .
          <source>Second Language Research</source>
          (
          <year>2024</year>
          )
          <volume>40</volume>
          (
          <issue>2</issue>
          ),
          <fpage>457</fpage>
          -
          <lpage>477</lpage>
          , doi: 10.1177/02676583231176370.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>E.</given-names>
            <surname>Volodina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bryant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Caines</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. De Clercq</surname>
            ,
            <given-names>J.- C.</given-names>
          </string-name>
          <string-name>
            <surname>Frey</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Ershova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rosen</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Vinogradova</surname>
          </string-name>
          , MultiGED
          <article-title>-2023 shared task at NLP4CALL: Multilingual Grammatical Error Detection</article-title>
          , in: D.
          <string-name>
            <surname>Alfter</surname>
            , E. Volodina,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>François</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Jönsson</surname>
          </string-name>
          , E. Rennes (Eds.),
          <source>Proceedings of the 12th Workshop on Natural Language Processing for Computer Assisted Language Learning (NLP4CALL</source>
          <year>2023</year>
          ),
          <year>2023</year>
          ,
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          , https://aclanthology.org/
          <year>2023</year>
          .nlp4call1.1.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>