<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LEARN: on the feasibility of Learner Error AutoRegressive Neural annotation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paolo Gajo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniele Polizzi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adriano Ferraresi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Barrón-Cedeño</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Università di Bologna</institution>
          ,
          <addr-line>Corso della Repubblica, 136, 47121, Forlì</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Error annotation is a defining feature of learner corpora, essential for understanding second-language development. Its centrality is mirrored by the meticulous efort required for its implementation, which is typically conducted in manual fashion. In this exploratory study, we investigate the feasibility of automating the task by training large language models (LLMs) in the context of dialogue-based Computer-Assisted Language Learning (CALL). We experiment with instruction-tuned LLMs across annotation granularities and prompting strategies. Results show that coarse-grained tags are more reliably predicted than ifne-grained ones, with few-shot example-based prompting outperforming context-only formats. These findings point to the potential of LLMs for semi-automatic error annotation, while underscoring the need for larger datasets and the efectiveness of training models through causal LM to handle rare linguistic phenomena. Code and data: https://github.com/paolo-gajo/LEARN</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;large language models</kwd>
        <kwd>low-rank adaptation</kwd>
        <kwd>error annotation</kwd>
        <kwd>learner corpora</kwd>
        <kwd>human-computer interaction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>and discourse-level features, including [6] on apologetic
expressions and [7] on evaluative stance, its applications
Error annotation plays a crucial role in learner corpus in the context of learner corpus research remain scarce.
research, a domain of inquiry that, while closely related To address this issue, we investigate the feasibility of
to second language acquisition (SLA), is distinguished training large language models (LLMs) to automate
erby its focus on providing insights into learners’ interlan- ror annotation, establishing a baseline for comparison
guage systems and acquisition patterns. The underlying while focusing on an increasingly relevant mode of text
assumption is that errors, defined as the application of an production: human-computer interactions [8]. The task
internalised rule not prescribed by established linguistic proves particularly challenging due to the complexity of
norms [1], are not merely indicators of textual quality, the tagset adopted, the model’s limited domain-specific
but a reflection of learners’ evolving competence in their expertise, and the scarcity of annotated training data
target language [2]. available. Our contributions are two-fold: (i) We release</p>
      <p>
        Regardless of the taxonomy’s level of granularity, error a novel dataset containing 2,675 manual annotations
annotation remains a time-consuming task, susceptible of linguistic errors across fifty texts. ( ii) Using
LoRAto inconsistencies in human judgment and inaccuracies tuned LLMs, we assess the impact of four combinations
from automatic parsers originally designed for native in- of prompting strategies on automatic error annotation
put [3]. As generative AI architectures begin to populate in human-computer written interactions, establishing a
linguistic toolkits [
        <xref ref-type="bibr" rid="ref13">4</xref>
        ] and mimic established approaches benchmark for future work in the area.
to language analysis [5], an opportunity arises to reduce The rest of the paper is structured as follows: Section 2
the burden of manual annotation while retaining the outlines the role of learner corpora in SLA research, with
depth of linguistic insight traditionally required for this a focus on error annotation practices. Section 3
introcomplex task. While a limited number of studies do in- duces the dataset and the tagset used in the experiments,
vestigate the use of the technology to annotate pragmatic along with a description of the annotation process.
SecCLiC-it 2025: Eleventh Italian Conference on Computational Linguis- tion 4 provides specifics on the model architecture,
traintics, September 24 — 26, 2025, Cagliari, Italy ing, and evaluation. Section 5 lays out the settings
ap* Corresponding author. proached for the automatic annotation task. Section 6
$ paolo.gajo2@unibo.it (P. Gajo); daniele.polizzi2@unibo.it reports the results of the experiments. Finally, Section 7
(aD.b.aPrroolinz@zi)u; naidbroia.into(A.fe.rBraarrerósin@-Cuendibeoñ.oit) (A. Ferraresi); draws conclusions and ofers suggestions on future
re https://www.unibo.it/sitoweb/paolo.gajo2 (P. Gajo); search avenues. In Appendix A, we provide a full list of
https://www.unibo.it/sitoweb/daniele.polizzi2 (D. Polizzi); the used categories and tags. Appendix B reports the full
https://www.unibo.it/sitoweb/adriano.ferraresi/cv-en (A. Ferraresi); results. Appendix C provides information on the used
https://www.unibo.it/sitoweb/a.barron (A. Barrón-Cedeño) computational resources.
      </p>
      <p>0009-0009-9372-3323 (P. Gajo); 0009-0007-1927-4158 (D. Polizzi);
0000-0002-6957-0605 (A. Ferraresi); 0000-0003-4719-3420
(A. Barrón-Cedeño)
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License</p>
      <p>Attribution 4.0 International (CC BY 4.0).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Motivation</title>
      <sec id="sec-2-1">
        <title>This challenge is not just one of scale, but also of</title>
        <p>
          scope. Learner corpora are still predominantly focused
Learner corpora are systematic collections of electronic on argumentative or academic writing, mirroring the
texts whose key defining feature lies in the representation types of structured tasks performed in traditional
eduof “language as produced by foreign or second language cational settings. Interactive language use, by contrast,
(L2) learners” [9]. They are increasingly used in various remains significantly underrepresented and tied to
semistrands of empirical SLA research, varying across multi- structured interview formats [
          <xref ref-type="bibr" rid="ref1">13</xref>
          ], which only partially
ple dimensions: medium (spoken or written), genre (such capture the dynamic and co-constructed nature of
realas essays, summaries and interviews), learners’ linguistic time communication. This gap is particularly
problembackground, sampling strategies (synchronic, longitudi- atic given the centrality of interactionist approaches to
nal or quasi-longitudinal), intended pedagogical or re- SLA, which emphasise the role of input, opportunity for
search purpose, and geographical scope of data collection output, feedback, and negotiation of meaning in driving
(ranging from local to large-scale initiatives) [9]. Each of acquisition [14]. As Granger [15] forecasts, the future
these design parameters shapes the corpus analytical po- of learner corpus research lies not only in enhancing
tential and determines its suitability for diferent lines of annotation practices but also in expanding corpora to
linguistic inquiry, particularly those aimed at identifying new educational contexts, each potentially introducing
developmental trajectories and persistent learner dificul- distinct patterns of learner language that call for targeted
ties [10]. Their structured format also makes them a valu- annotation strategies.
able resource for the development of natural language Shifts towards greater variability in learner data
amprocessing (NLP) applications grounded in authentic data plify the need for scalable, adaptive annotation methods.
that are used for educational purposes [
          <xref ref-type="bibr" rid="ref29">11</xref>
          ]. Our contribution presents an exploratory case study
in
        </p>
        <p>Central to all of these applications is the identification vestigating whether small-scale, open-weight LLMs can
and classification of errors, which serve not only as indi- reliably be trained to automate learner error annotation,
cators of language proficiency but also as windows into evaluating not only their diagnostic capabilities but also
the evolving interlanguage systems of learners. These their alignment with linguistic taxonomies and
estaberrors are signalled using a predefined taxonomy that lished error annotation conventions. More specifically,
serves the purpose of assigning tags, i.e. labels captur- we test this feasibility in an unconventional setting for
ing specific categories and subcategories of errors, to the learner corpora annotation: informal dialogue practice.
corresponding portion of text. To ensure consistency,
annotation typically follows detailed guidelines, which
provide operational definitions and prototypical cases for 3. Data
each tag. However, the process still requires annotators
to formulate a hypothesis about the nature of each error, The dataset employed contains human–machine written
interpreting the distance between the learner’s produc- interaction data, contributing to an increasingly relevant
tion and the expected target form as either structural or research strand focusing on conversational AI’s
efectivelinguistic per se [2]. ness for language development [14]. It features
English</p>
        <p>In spite of the subjectivity inherently built into the as-foreign-language (EFL) productions of Italian
univertask, expert judgment has so far ofered the most reliable sity students aged 18–25 from diverse degree programs,
means of ensuring both consistency and linguistic accu- most of whom self-report a low-to upper-intermediate
racy, striking a delicate balance between introspection proficiency level. One distinct interaction for each
stuand methodological rigour that underpins high-quality dent (50 in total) was collected based on a protocol
comlearner corpus annotation. While projects like the Cam- bining one of two diferent LLM-based chatbots with two
bridge Learner Corpus (CLC)1 and the International Cor- EFL learning scenarios. The chatbots used during the
pus of Learner English (ICLE)2 have demonstrated the experimental sessions are ChatGPT,3 a general-purpose
value of error-tagged data for SLA research, annotation Generative AI tool, and Pi.ai,4 a task-oriented chatbot
remains labour-intensive and demands substantial ex- specifically developed to engage in natural language
conpertise and time investment. The existence of automatic versation. The learning scenarios are structured around
approaches to learner corpus error annotation, by con- two communicative formats that constitute part of
stantrast, remains largely limited. Although some research dardised English proficiency tests: open-ended
conversahas investigated advanced technologies such as LLMs for tion (small talk) and target-oriented dialogue (role play).
grammatical error identification [ 12], to the best of our While small talk allows participants to freely express
knowledge no published work has explored their capacity themselves on past experiences, current interests and
to perform full-fledged annotation of learner language. events or future projects, role playing requires them to</p>
      </sec>
      <sec id="sec-2-2">
        <title>1https://www.cambridge.org/elt/corpus/learner_corpus2.htm</title>
      </sec>
      <sec id="sec-2-3">
        <title>2https://www.uclouvain.be/en/research-institutes/ilc/cecl/icle</title>
      </sec>
      <sec id="sec-2-4">
        <title>3https://chatgpt.com/</title>
      </sec>
      <sec id="sec-2-5">
        <title>4https://pi.ai/talk</title>
        <p>Source Token Count and calques have been assigned a distinct subcategory
Learner-Produced (total) 17,730 (LWCO) falling within that of lexis (L) rather than form</p>
        <sec id="sec-2-5-1">
          <title>Small talk 10,548 (F). The rationale behind this change follows on Cervini</title>
          <p>ChaRtobloetp-Glaeynerated (total) 957,,312802 and Paone’s [17] classification of intercomprehension</p>
        </sec>
        <sec id="sec-2-5-2">
          <title>Small talk 39,033 strategies, where both calques and neologisms are con</title>
        </sec>
      </sec>
      <sec id="sec-2-6">
        <title>Role play 56,287 ceived as pertaining to the lexical dimension of commu</title>
        <p>Total 113,901 nication. The remaining macro-categories are retained
as originally defined [ 16]. Grammatical Errors (G) are
Table 1 violations of standard grammar rules that afect
syntacDataset token distribution by task type. tic structure, including subject–verb agreement, misuse
of tenses, article errors, or problems with word forms,
such as pronouns and determiners. Lexico-Grammatical
use context-sensitive vocabulary and formulaic language. Errors (X) involve combination patterns specific to the
As such, both tasks prove particularly efective in cover- word rather than sentence-wide grammar, including
deing a wide variety of use cases where multiple examples pendent prepositions or verb complementations. Lexical
of errors might appear, ranging from grammar and lexis Errors (L) concern vocabulary choices that do not match
to register and style. The dataset annotation scheme the intended meaning or context, hence coming across
features structural information on turns and contextual as semantically awkward or stylistically inappropriate.
information on the chatbot used, the tasks performed and Word Errors (W) target imbalances in a sentence caused
the learner profile. Token counts are reported in Table 1. by omitting necessary words, adding superfluous ones,
or placing words in an unnatural or incorrect order.
Punc3.1. Tagset tuation Errors (Q) cover incorrect, missing, or excessive
use of marks, such as commas, periods, or colons. Finally,
Our benchmark for automatic error identification con- Infelicities (Z) address stylistic concerns that, while not
sists of fifty texts manually annotated by two expert strictly errors, may require reformulation for the sake
anglicists, using an adapted version of the Louvain Error of clarity or naturalness (Z). See Table 8 in Appendix A
Tagging Manual Version 2.0 [16]. While the taxonomy for a complete list of the tags used, together with a brief
does not align with any specific formal SLA theory or description of their coverage for each use case.
L1–L2 pairing, it was selected precisely for its broad Errors were marked using inline XML-style tags
recognition within the learner corpus research commu- of the format &lt;TAG corr="correction"&gt;incorrect
nity, a de facto standard providing a comprehensive map- text&lt;/TAG&gt; via the Université Catholique de Louvain
ping of errors discussed in the field. The adaptation was Error Tagging Editor (UCLEE).5 In case of the addition
carried out through preliminary pilot tests and includes of missing words or the omission of redundant ones,
several fine-tuning operations that introduce revised use the format is &lt;TAG corr="correction"&gt;\0&lt;/TAG&gt;
cases and five new tags. The updated manual comprises or &lt;TAG corr="\0"&gt;incorrect text&lt;/TAG&gt;, respectively.
59 categories, spanning across eight domains: digitally- The software supports the insertion, editing and
processmediated communication (DMC), form (F), punctuation ing of error tags using a preferred tagset. To
accommo(Q), grammar (G), lexico-grammar (X), lexis (L), word date the specific requirements of our task, we uploaded a
(W), infelicities (Z) and code-switching (CS). custom .tag file reflecting the necessary modifications we</p>
        <p>A subset of cases previously assigned to the category had implemented. A truncated example of file annotation
of formal errors, “unwarranted use of mother-tongue can be found in Figure 1.
words” [16], constitutes now a separate category: namely, In line with the Louvain Manual, corrections were
minthat of intra- or inter-sentential code-switching. The split imal and hypothesis-driven, ensuring that tags reflect
was essential to distinguish between involuntary devi- plausible learner intentions and do not result in
specuations from the expected spelling norm (covered by F, lative rewriting of the original text. Tags were assigned
along with morphological errors in derivational afixes) based on the erroneous form itself, using the shortest
and explicit cases of L1 interference as a coping mech- possible span required to isolate it. Regional spelling
anism in second-language communication. In a similar variants (e.g., British and American English) were not
fashion, all instances of missing capitalisation, includ- flagged, as participants received no instruction on
preing lowercase letters at the beginning of a conversational ferred norms. Likewise, punctuation errors were
annoturn, were assigned to DMC to capture features of texting tated only when they hindered readability, in recognition
that likely reflect the informal nature of the task rather of informal communication habits. Cases where multiple
than language competence alone. These also include ab- errors overlapped were nested within one another, with
breviations commonly found in the context of instant
messaging, such as BTW or LOL. Finally, neologisms
plied the same error tag to mark the exact same
character span as erroneous. Scores registered a mean of
0.77024 ± 0.09270. The computation was repeated a
second time on all tags except those targeting formal spelling
(FS) and digitally-mediated communication (DMC). That
is, taking into account the most subjective among the
sub-categories in our tagset, which account for 53.60% of
all the tagged issues. The results show an agreement of
0.74698 ± 0.13027. Given the strictness of our criteria,
we consider the obtained IAA to be highly satisfactory
and reliable, since  &lt; 0 signifies worse-than-random
agreement and the upper bound is  = 1.</p>
        <p>3.2. Data processing
&lt;?xml version="1.0" encoding="utf-8"?&gt;
&lt;file name="id_1.txt" tagset="uclee-en-2.0.tag"&gt; &lt;text
id="id_1" area_of_study="Social sciences" age="24" [...]&gt;
&lt;task type="small talk"&gt;
&lt;turn type="chatbot" who="Pi.ai"&gt;Hey there, great to
meet you. I’m Pi, your personal AI. [...]&lt;/turn&gt;
&lt;turn type="student"&gt;Hi&lt;/turn&gt;
&lt;turn type="chatbot" who="Pi.ai"&gt;Hey User!
How’s everything going on your side? [...]&lt;/turn&gt;
&lt;turn type="student"&gt;&lt;DMCC
corr="How"&gt;how&lt;/DMCC&gt; are you today?&lt;/turn&gt;
[...]
&lt;/task&gt;
&lt;task type="role play"&gt;
&lt;turn type="student"&gt;&lt;DMCC
corr="You"&gt;you&lt;/DMCC&gt; are an encouraging tutor
who helps students improve their &lt;DMCC
corr="English"&gt;english&lt;/DMCC&gt; by engaging in role
play &lt;FS corr="activities"&gt;actvities&lt;/FS&gt;.&gt;[...]&lt;/turn&gt;
&lt;turn type="chatbot" who="Pi.ai"&gt;Great idea! Let’s
start the role play. As the Restorative Justice, I’m
interested in [...]&lt;/turn&gt;
[...]
&lt;/task&gt;
&lt;/text&gt;
&lt;/file&gt;</p>
      </sec>
      <sec id="sec-2-7">
        <title>The data are compiled by filtering out the chatbot re</title>
        <p>sponses and splitting the collection into training,
development, and testing partitions with an 80/10/10 split.</p>
      </sec>
      <sec id="sec-2-8">
        <title>Five diferent (fixed) seeds are used to split the data and</title>
        <p>initialise model states, which helps us mitigate variance
in the results. Table 2 provides information on the
distribution of the tags, which has a long tail formed by rare
tags, 22 of which have fewer than 10 occurrences.</p>
      </sec>
      <sec id="sec-2-9">
        <title>As exemplified in Figure 2, we experiment with two</title>
        <p>types of in-context learning (ICL) sections (bottom row),
Figure 1: XML annotation output of the UCLEE software. each using fine- or coarse-grained tags (top row), for a
total of four prompt combinations. The prompt starts with
Table 2 a system message defining the LLM persona, followed
Distribution of the tags in the data used for training, develop- by the instruction. The macro categories or tags are then
ment, and testing. optionally listed. In the first experimental setting, a
varying number of ICL examples is included. For all data
Tag # splits, pairs of examples are sampled at random solely
DMCC 927 LP 45 GDO 13 XNCO 4 from the training set, across any of the student-chatbot
FGSA 134194 LLSSVN 4435 XQNRUC 1122 XGAPDDJCO 34 conversations. We sample an equal number of examples
LSPR 80 CSINTRA 33 CSINTER 11 LCC 3 with and without error annotations.6 Finally, the task is</p>
      </sec>
      <sec id="sec-2-10">
        <title>GNN 80 GVN 32 GPI 10 LCLC 3 repeated to mark the target sentence.</title>
        <p>GPP 72 XVPR 28 GADVO 9 GADJO 2 In the second setting, we provide the model with the
GWVOT 6643 GQNCC 2274 GGDDTI 88 XGPPRUCO 22 context of the conversation to which the target message
QM 60 GVNF 23 GADJCS 7 GPO 2 belongs. Note that in this case, what we divide in 80/20/20
Z 54 DMCA 23 XNPR 7 XADVPR 1 splits is the list of conversations, rather than the
individLXWVCCOO 5512 GGPVRM 2108 GQDLD 66 LGCPLFS 11 ual messages. Since conversations do not all have the
WM 51 LSADV 16 FM 5 same size, in this case each seed produces diferent split</p>
      </sec>
      <sec id="sec-2-11">
        <title>GVAUX 51 GWC 15 XADJPR 5 sizes, as shown in Table 3.</title>
        <p>WR 49 LSADJ 15 LCS 4 In our experiments, we wish to showcase the impact
of using random annotated instances vs unannotated
context. Therefore, although the data partitions used in
spelling errors being considered the lowest level, i.e. the the two settings are produced in diferent ways, we still
ifrst correction to be applied. deem our approach to be valid, considering the use of</p>
        <sec id="sec-2-11-1">
          <title>Inter-annotator agreement (IAA) was calculated on five diferent seeds.</title>
          <p>
            ifve separate texts using the Gamma coeficient [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ], a
metric suited to evaluating categorical labels with
overlapping text spans. Annotation files were first parsed
to extract error tags and their corresponding
character ofsets using a custom XML processing function.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-12">
        <title>The agreement was recorded only when annotators ap</title>
      </sec>
      <sec id="sec-2-13">
        <title>6The original and the annotated utterances are separated by ###</title>
        <p>symbols to avoid any subwords being merged with the separator
by the used tokenisers.</p>
        <p>You are an AI specialized in the task of annotating grammatical errors.</p>
        <p>Annotate the target sentence below with the following tags, in XML style. Reproduce the full sentence and annotate each error.
The following are the tags you should use for annotation:
&lt;DMCC&gt;: Capitalization issues. [...]
&lt;WO&gt;: Errors in word order.</p>
        <p>Code-Switching: use of L1 (native language). [...]</p>
        <p>Infelicities: stylistic concerns (not strictly errors).</p>
        <p>Below are reference examples:
Everything is going fine. How are you?###Everything is</p>
        <p>tence:
going fine. How are you? [...]
The food is not very good in spain and but the
atmophere Is fantastic###The food is not very good in &lt;DMCC
corr="Spain"&gt;spain&lt;/DMCC&gt; &lt;LCC corr="\0"&gt;and&lt;/LCC&gt;
but the &lt;FS corr="atmosphere"&gt;atmophere&lt;/FS&gt; &lt;DMCC
corr="is"&gt;Is&lt;/DMCC&gt; fantastic</p>
        <p>Below are the chat messages preceding the target
senPi.ai: Hey there, great to meet you. I’m Pi, [...]
student: Hi pi can we do a roleplay to help me practice
my english?
Pi.ai: Absolutely, User! Role-playing can be a great way
[...] student: I would like to do a customer service scenario</p>
        <p>Pi.ai: Sure thing! Let’s start the [...]
Annotate the following target sentence, without providing any explanation:</p>
        <p>Yes please, I would like a bottle of water and a glass of wine###
Split sizes for the training, development, and testing partitions, sentences not just from the target sentence, but also from
for the random ICL sampling and context prompt settings.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Model</title>
      <p>Train
831</p>
      <p>Dev
104</p>
      <p>Test
104
8The model needs to be given the prompt in a chat template
(https://huggingface.co/docs/transformers/en/chat_templating#
applychattemplate) which we omit here for clarity.
since we want the model to learn to predict the annotated
the tags and the examples included in the prompt. In
other words, we simultaneously train the model on a
large amount of sampled examples within the prompt,
through teacher forcing, and we also instruction-tune it
to predict the desired target sentence.</p>
      <p>The architecture of these models consists in a
token/positional embedding layer, followed by a stack of decoders,
with a language modeling classifier on top. Each decoder
comprises a grouped-query attention layer [21], followed
by a set of MLP layers each using a SwiGLU activation
function [22]. We update the weights of the decoder
blocks with LoRA [23], only targeting the key, query, and
value matrices , ,  of the attention layers:
︂( ⊤
√</p>
      <p>︂)
+</p>
      <p>Attention(, ,  ) = Softmax
where  is the matrix filled with zero values in the
lower triangular part and −∞
output dimension of  and . The attention and MLP
layer parameters are kept frozen during training. The
original input to these layers is simultaneously processed
through LoRA components consisting of weight matrices
 ∈ R1×  and  ∈ R× 2 , where  ≪
represents the low-rank projection dimension, while 1
and 2 correspond to the input and output dimensions
of each respective layer. During training, only the LoRA
matrices  and  receive parameter updates. Thus, the
forward pass of an input x through an MLP with frozen
1, 2. Here, 
elsewhere, and  is the
weight 0 is modified as:
2 × 10− 4 with 5 warm-up steps, weight decay of 0.01,
and AdamW [25] as the optimization algorithm. Prior to
ifne-tuning, Llama-3.3-70B-Instruct is quantized at 4-bit
precision with QLoRA [26], using bitsandbytes.9</p>
      <p>Due to the sparsity of low-occurrence tags, we focus
on evaluating the model on the most common ones using
micro-averaged precision, recall, and F1-measure. The
prediction of a tag is considered correct only if both the
tag and the associated text match. For example, in the
sentence &lt;DMCC corr="Not"&gt;not&lt;/DMCC&gt; really,
what is your proposal &lt;QM corr="?"&gt;\0&lt;/QM&gt;
the prediction would be incorrect if the tag DMCC was
assigned to “not really” rather than just “not”. As regards
this example, also note that the model is required to
generate “\0” tokens, representing omitted words.</p>
      <sec id="sec-3-1">
        <title>Each model is fine-tuned and evaluated on five difer</title>
        <p>ent seeds, for which we report the average performance
along with the standard deviation. During evaluation,
we allow the model to generate up to 1,000 new tokens,
which we deem suficient based on instance lengths. We
select the best epoch based on the highest micro-averaged</p>
      </sec>
      <sec id="sec-3-2">
        <title>F1-measure on the development set. We report micro</title>
        <p>averaged metrics, since macro-averaging does not
provide a faithful picture of model performance, due to the
long tail of low-occurrence classes (Table 2).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Experiments</title>
      <sec id="sec-4-1">
        <title>We task the fine-tuned models to automatically annotate glish. We experiment with two levels of granularity of error classification, one at the level of the macro category i.e. those listed in Table 2.</title>
        <p>We also use two diferent types of prompts. The first
includes ICL ∈ {0, 2, 4, 6, 8, 10} pairs of unannotated
and annotated student messages. We vary the number
because an insuficient amount might not provide the
model with enough information to produce optimal
performance, while an excessive quantity might excessively
shift attention from the target task. The second type of</p>
      </sec>
      <sec id="sec-4-2">
        <title>9https://github.com/bitsandbytes-foundation/bitsandbytes</title>
        <p>Llama-3.3-70B-Instruct
✓
×
×
×
✓
✓
0
2
4
6
8
0
2
4
6
8
6
10
10
0
2
4
6
8
0
2
4
6
8
6
10
10
linguistic errors in sentences written by learners of En- prompt includes the  = 10 chat messages preceding the
(e.g., “Form”, or “Punctuation”) and one at the tag level, parameter search as regards the number of in-context
Llama-3.3-70B-Instruct</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Results</title>
      <p>Random sampling ICL</p>
      <sec id="sec-5-1">
        <title>The results marginalised</title>
        <p>across all classes for the fine-grained setting are listed in
pairs of examples, 6 positive and 6 negative. This shows Table 6
our concerns with finding the best number of examples Overall micro-averaged results for Llama-3.1-8B-Instruct and
were founded, since higher amounts lead to increasingly Llama-3.3-70B-Instruct for the context prompt setting, using
worse performance. However, most of the performance fine-grained (ℱ ) and coarse () categories.
gain is obtained by going from ICL = 0 to even just Tags F1 Precision Recall
providing 2 pairs of examples, even without the model
Llama-3.1-8B-Instruct
being shown the meaning of the tags. Indeed, overall
wthheebnensottreinscullutsdifnogr tLhleamtagas-3a.1n-d8Bth-eInirstdreuscctriaprteioancshiinevtehde ℱ ✓× 00..220271 ±± 00..009719 00..223576 ±± 00..100701 00..119948 ±± 00..009833
spuroltms,pwt.hGeraejoinacnrdeaBsainrrgótnh-eCneudemñboe[r2o7f]erxeapmorptlesismyiilealrdreed-  ✓× 00..128364 ±± 00..005960 00..221745 ±± 00..100907 00..129018 ±± 00..007858
diminishing returns when extracting RDF triples from Llama-3.3-70B-Instruct
texts and overly long lists of references in the prompt ℱ × 0.395 ± 0.109 0.360 ± 0.088 0.375 0.095
diluted model attention away from the target task.  × 0.455 ± 0.084 0.417 ± 0.076 0.434 0.077</p>
        <p>Fine-tuning Llama-3.3-70B-Instruct with the best
hyperparameter ICL = 6 and no tags in the prompt, the
model obtains a micro-F1 of 0.472. Out of five seeds, the this hints at the fact that the model more easily handles
highest validation performance is obtained twice on the structural errors, compared to those where style and
ifrst epoch, twice on the second, and only once on the semantics are involved.
third. Since the model is only shown 831 training exam- Table 10 in Appendix B reports the results for each
ples and the first and second epochs already provide the coarse-grained category for all values of ICL.
best performance, the model seems to fit very quickly to
the patterns it needs to recognize to identify errors. Context ICL As shown in Table 6, the performance</p>
        <p>The overall results for the coarse-grained categories using context prompts is much lower than when using
are reported in Table 5. The performance is overall randomly sampled example pairs. An analysis of
Llamaslightly higher when including the categories in the 3.1-8B-Instruct’s predictions shows that, at times, the
prompt. In this case, since only 9 classes are listed, the model makes mistakes even on easy instances of the DMC
model is able to make good use of the provided informa- category, i.e. the one with overall highest results. For
extion. Indeed, not only are the mean scores higher, but ample, in “student: It’s perfect! Thank &lt;XVCO
the standard deviation is also lower at ICL = 6, which corr="you"&gt;u&lt;/XVCO&gt; so much”, the model assigns
is the setting that yields the highest performance with XVCO (errors with verb complementation) rather than
Llama-3.1-8B-Instruct. As for Llama-3.3-70B-Instruct, per- DMCA to a clear-cut case of Internet-style abbreviation.
formance is greater, but with a smaller gap between the Considering the performance on this class is above 0.800
two models, compared to the fine-grained tags. when using random ICL example pairs, this is a clear</p>
        <p>The full results for each fine-grained tag at all values hint that the context does not provide useful information
of ICL are reported in Table 9 in Appendix B. At the for the best-performing categories. Indeed, the
macroifne-grained level, only a few high-frequency tags such categories for which contextual information is likely to
as DMCC (927 instances) and FS (314) are predicted reli- be most relevant are lexis (L) and infelicities (Z), where
ably. Most of the others are either predicted with very discourse-level or pragmatic cues are critical in assessing
high standard deviations or do not receive predictions at appropriateness and distinguishing genuine errors from
all, due to the sparsity of labels. Nonetheless, the perfor- stylistic deviations. However, as shown in Table 7, the
mance for several morphosyntactic tags, e.g. GNN (80), performance for these categories is very low (L) or null
GPP (72) and GVAUX (51) exhibits gradual improvements (Z). For Llama-3.1-8B-Instruct, the performance on the L
with increasing values of ICL, indicating that training category (F1 = 0.070) is worse than the one obtained in
the model on a higher number of examples might be the random ICL sampling setting, even with ICL = 0
beneficial for some classes. (F1 = 0.091, see Table 10). Therefore, even in the cases</p>
        <p>Based on the distribution shown in Table 2, the amount in which the model would supposedly benefit from being
of training instances per class indeed seems to strongly provided the context of the conversation, simply having
correlate with performance. However, Z (54), used to it memorize decontextualized examples through causal
indicate stylistic problems, is never predicted correctly by language modeling provides better performance. Indeed,
either of the models, despite having a number of instances as already mentioned in the previous section, the model
comparable to that of much better-performing classes, likely pays more attention to the shallow structure of the
e.g. QM (60) or WM (51), respectively used for missing sentence rather complex semantic relationships. Thus,
punctuation and words. Since the latter clearly afect having it learn annotations directly from XML-formatted
the format and structure of the sentence via omission, examples provides superior performance. This is also
Table 7 mance improved via ICL examples, peaking around 6
Micro-averaged F1 results per category for Llama-3.1-8B- pairs of positive and negative instances, before exhibiting
Instruct and Llama-3.3-70B-Instruct with the best-performing diminishing returns. This trend held across both
granuICL = 6 using coarse-grained categories. C=CS, D=DMC. larities and prompt types, although not always linearly.</p>
        <p>Rng (ICL = 6) Context ( = 10) In particular, random example-based prompts yielded
Tags 8B 70B 8B 70B scuobnstteaxnt-toianlllyy hoingehse, rfoarnbdomthortehestfinaeb-learnedsuclotasrcsoem-gpraarinededto
C × 0.050 ± 0.112 0.000 ± 0.000 annotation tasks, suggesting that focused
demonstra✓ 0.197 ± 0.192 0.175 ± 0.186 0.000 ± 0.000 0.053 ± 0.119 tion of error-tag mappings better supports autoregressive
D × 0.813 ± 0.059 0.512 ± 0.149 modeling than situational grounding. The lower
efective✓ 0.827 ± 0.051 0.854 ± 0.036 0.552 ± 0.130 0.759 ± 0.088 ness of context-only prompts may also reflect a mismatch
F × 0.534 ± 0.047 0.269 ± 0.088 between the data and the annotation scheme, where error
✓ 0.497 ± 0.123 0.551 ± 0.090 0.155 ± 0.060 0.433 ± 0.103 identification, at least of the issues observed in these
conG ✓× 00..320467 ±± 00..002359 00..303934 ±± 00..004415 0.075 ± 0.037 0.242 ± 0.061 versations, is mostly self-contained within each learner’s
turn. Including additional text to be processed likely
Z ✓× 00..000000 ±± 00..000000 00..000000 ±± 00..000000 0.000 ± 0.000 0.000 ± 0.000 dilutes the model’s attention, which is spread across a
higher number of tokens, ultimately lowering learning
X ✓× 00..006648 ±± 00..009654 00..101070 ±± 00..104090 0.000 ± 0.000 0.038 ± 0.054 efectiveness.</p>
      </sec>
      <sec id="sec-5-2">
        <title>At a tag-specific level, results highlight the challenges</title>
        <p>L ✓× 00..116587 ±± 00..007564 00..200615 ±± 00..005412 0.070 ± 0.050 0.103 ± 0.048 of sparse class supervision for this task, with only a
handful of high-frequency labels being predicted
reliQ ✓× 00..212924 ±± 00..110229 00..206020 ±± 00..100020 0.000 ± 0.000 0.184 ± 0.200 ably. Nonetheless, we provide evidence of LLMs being
able to internalise recurring learner patterns through
W ✓× 00..006861 ±± 00..006673 00..101070 ±± 00..102090 0.000 ± 0.000 0.050 ± 0.090 causal LM, given they are shown enough instances.</p>
        <p>Variation across the explored hyperparameters was
modest. This implies that the performance ceilings are
clear based on the fact that Llama-3.1-8B-Instruct can primarily determined by task complexity and data
sparoutperform its bigger counterpart just by changing the sity, rather than the suboptimal nature of specific training
prompting strategy, although the performance obtained approaches.
by Llama-3.3-70B-Instruct when using context prompts In future work, we plan to produce synthetic training
is closer to the one obtained with random sampling ICL. data for the task approached in this work, in order to
im</p>
        <p>The context ICL results for all fine-grained tags can be prove model performance. In addition, we wish to extend
found in Table 11 in Appendix B. the annotation to additional resources and leverage them
for the development of better automatic error annotation
systems. Finally, we aim to evaluate model performance
7. Conclusions also in terms of the proposed corrections.
In this study, we have built a corpus of human-computer
interactions, assessing the feasibility of fine-tuning LLMs
to automatically carry out error annotation. Through a
series of experiments across two annotation granularities
(coarse and fine-grained), we evaluated the capabilities
and limitations of both Llama-3.1-8B-Instruct and
Llama</p>
      </sec>
      <sec id="sec-5-3">
        <title>3.3-70B-Instruct to learn through causal LM from two</title>
        <p>prompting paradigms. The first included the
conversation context of the message requiring annotation, while
the other entailed a varying number of randomly sampled</p>
      </sec>
      <sec id="sec-5-4">
        <title>ICL examples. Both prompt types optionally included</title>
        <p>explicit information about the target error classes.</p>
      </sec>
      <sec id="sec-5-5">
        <title>Perhaps unsurprisingly, coarse-grained annotation ob</title>
        <p>tains better scores than fine-grained tagging across all
configurations, suggesting the viability of a hybrid,
semiautomatic pipeline where LLMs handle broader error
categories before finer distinctions are resolved through
human post-editing or specialised tools. Model
perfor</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <sec id="sec-6-1">
        <title>We express our sincere gratitude to Arianna Paradisi</title>
        <p>(University of Bologna) for her valuable support and
insightful contributions to the development of the error
tagging manual. Her expertise and collaboration were
instrumental in shaping the guidelines used in this work.</p>
      </sec>
      <sec id="sec-6-2">
        <title>We also thank the research team of the UNITE - UNiver</title>
        <p>sally Inclusive Technologies to practice English10 project
for providing the resources that made this study possible.
10UNITE – UNiversally Inclusive Technologies to practice English
(funded by the European Union – NextGenerationEU under Italy’s
National Recovery and Resilience Plan (PNRR); Project code
2022JB5KAL, CUP J53D23008070006
moyer, Qlora: Eficient finetuning of quantized
llms, arXiv preprint arXiv:2305.14314 (2023).
[27] Gajo, Barrón-Cedeño, Natural vs Programming</p>
      </sec>
      <sec id="sec-6-3">
        <title>Language in LLM Knowledge Graph Construc</title>
        <p>tion, Information Processing &amp; Management 62
(2025) 104195. URL: https://www.sciencedirect.com/
science/article/pii/S0306457325001360. doi:https:
//doi.org/10.1016/j.ipm.2025.104195.</p>
        <p>Categories (in italics), descriptions, and references for the error tags used in corpus annotation.</p>
        <p>Description
Digitally-Mediated Communication</p>
        <p>&lt;DMCC&gt; Capitalization issues.</p>
        <p>Errors in coordinating conjunctions. &lt;LCS&gt; Errors in subordinating conjunctions.
Errors with single logical connectors. &lt;LCLC&gt; Errors with complex logical connectors.
Conceptual/collocational errors with adjectives. &lt;LSADV&gt; Conceptual/collocational errors with adverbs.
Conceptual/collocational errors with nouns. &lt;LSPR&gt; Conceptual/collocational errors with prepositions.
Conceptual/collocational errors with verbs. &lt;LWCO&gt; Coined words or calques.</p>
        <p>Errors in fixed word combinations, including idioms, compounds, and phrasal verbs.</p>
        <p>Missing words.</p>
        <p>Word order errors.</p>
        <p>&lt;WR&gt;</p>
        <p>Redundant words.</p>
        <p>Code-switching within a sentence.</p>
        <p>&lt;CSINTER&gt;</p>
        <p>Code-switching between sentences or turns.</p>
        <p>Stylistic problems or unclear sequences requiring reformulation.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>A. Full list of tags</title>
    </sec>
    <sec id="sec-8">
      <title>C. Computational resources</title>
      <sec id="sec-8-1">
        <title>In this section, we report on the tagset used for the learner</title>
      </sec>
      <sec id="sec-8-2">
        <title>For each prompt type, training Llama-3.1-8B-Instruct took</title>
        <p>error annotation task, a revised version of the UCLou- ∼ 20 minutes on a single NVIDIA H100 (96GB of VRAM),
vain Error Editor Version 2. Table 8 lists all of the error
macro- and micro-categories, their specific tags, and a
brief description of each tag.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>B. Full results</title>
      <sec id="sec-9-1">
        <title>Here, we report the full results for Llama-3.1-8B-Instruct</title>
        <p>and Llama-3.3-70B-Instruct. The results for the random</p>
      </sec>
      <sec id="sec-9-2">
        <title>ICL sampling setting are reported in Table 9 for the fine</title>
        <p>grained tags and in Table 10 for the coarse-grained
categories. The results for the fine-grained categories in the
context prompt setting are reported in Table 11.
for a total of about 17 hours over all the 50 combinations
of seeds and hyperparameters. Training
Llama-3.3-70B</p>
      </sec>
      <sec id="sec-9-3">
        <title>Instruct for each of its five runs per setting took around</title>
        <p>90 minutes, for an additional 15 hours for the two prompt
types.
FS
GA
GADVO
GDI
GNC
GNN
GPI
GPP
GPR
GVAUX
GVM
GVN
GVNF
GVT
GWC
LP
LSADJ
LSADV
LSN
LSPR
LSV
LWCO
QC
QM
WM
WO
WR
XADJPR
LSN
LSPR
QM
XVCO
Declaration on Generative AI</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [13]
          <article-title>Centre for English Corpus Linguistics</article-title>
          , Learner cor-
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>pora around the world</article-title>
          ,
          <year>2024</year>
          . [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Berruto</surname>
          </string-name>
          ,
          <article-title>Le regole in linguistica</article-title>
          , in: N. Grandi [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bibauw</surname>
          </string-name>
          , W. Van den Noortgate, T. François,
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          Press, Bologna,
          <year>2015</year>
          , pp.
          <fpage>43</fpage>
          -
          <lpage>61</lpage>
          .
          <article-title>A meta-analysis, Language Learning</article-title>
          &amp; Technol[2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lüdeling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hirschmann</surname>
          </string-name>
          , Error annotation sys- ogy
          <volume>26</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>24</lpage>
          . URL: https://www.lltjournal.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          tems, in: S. Granger,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gilquin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Meunier</surname>
          </string-name>
          (Eds.), org/item/10125-
          <fpage>73488</fpage>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          The Cambridge Handbook of Learner Corpus Re- [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Granger</surname>
          </string-name>
          ,
          <article-title>Learner corpora and error annotation:</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>search</surname>
          </string-name>
          , Cambridge University Press,
          <year>2015</year>
          , pp.
          <fpage>135</fpage>
          -
          <article-title>Where are we and where are we going?</article-title>
          , Interna-
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          157. doi:
          <volume>10</volume>
          .1017/CBO9781139649414.007.
          <source>tional Journal of Learner Corpus Research</source>
          <volume>10</volume>
          (
          <year>2024</year>
          ) [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Gilquin</surname>
          </string-name>
          ,
          <article-title>Learner corpora</article-title>
          , in: M.
          <string-name>
            <surname>Paquot</surname>
          </string-name>
          , S. T.
          <volume>25</volume>
          -
          <fpage>45</fpage>
          . doi:
          <volume>10</volume>
          .1075/ijlcr.00008.gra.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Gries</surname>
            (Eds.),
            <given-names>A Practical</given-names>
          </string-name>
          <string-name>
            <surname>Handbook of Corpus</surname>
            Lin- [16]
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Granger</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Swallow</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Thewissen</surname>
          </string-name>
          , The lou-
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          guistics, Springer, Cham,
          <year>2020</year>
          , pp.
          <fpage>283</fpage>
          -
          <lpage>303</lpage>
          . vain error tagging
          <source>manual version 2</source>
          .0,
          <year>2022</year>
          . [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Anthony</surname>
          </string-name>
          ,
          <article-title>Corpus ai: Integrating large language URL: https://oer</article-title>
          .uclouvain.be/jspui/bitstream/
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>models (llms) into a corpus analysis toolkit</article-title>
          ,
          <year>2023</year>
          .
          <volume>20</volume>
          .500.12279/968/4/Granger%20et%
          <fpage>20al</fpage>
          ._Error%
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          URL: https://osf.io/srtyd/.
          <source>20tagging%20manual%202</source>
          .
          <article-title>0_final_CC.pdf</article-title>
          . [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Curry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Baker</surname>
          </string-name>
          , G. Brookes, Generative ai for [17]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cervini</surname>
          </string-name>
          , E. Paone, Comunicare all'universitÀ:
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>evaluation of chatgpt, Applied Corpus Linguistics LinguaDue</source>
          <volume>16</volume>
          (
          <year>2024</year>
          )
          <fpage>496</fpage>
          -
          <lpage>523</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <volume>4</volume>
          (
          <year>2024</year>
          )
          <fpage>100082</fpage>
          . [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mathet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Widlöcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Métivier</surname>
          </string-name>
          , The unified [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Fuoli, Assessing the po- and holistic method gamma ( ) for inter-annotator</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>pragmatics and discourse analysis: The case of apol- Linguistics</source>
          <volume>41</volume>
          (
          <year>2015</year>
          )
          <fpage>437</fpage>
          -
          <lpage>479</lpage>
          . URL: https://doi.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>ogy</surname>
          </string-name>
          ,
          <source>International Journal of Corpus Linguistics 29 org/10</source>
          .1162/COLI_a_00230. doi:
          <volume>10</volume>
          .1162/COLI_
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          (
          <year>2024</year>
          )
          <fpage>534</fpage>
          -
          <lpage>561</lpage>
          . doi:
          <volume>10</volume>
          .1075/ijcl.23087.yu. a_
          <volume>00230</volume>
          . [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Imamovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Deilen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Glynn</surname>
          </string-name>
          , E. Lapshinova- [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          , J. Uszkoreit,
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          in: S. Henning, M. Stede (Eds.),
          <source>Proceedings of Information Processing Systems</source>
          , volume
          <volume>30</volume>
          , Cur-
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>the 18th Linguistic Annotation Workshop</source>
          (LAW- ran
          <string-name>
            <surname>Associates</surname>
          </string-name>
          , Inc.,
          <year>2017</year>
          . URL: https://proceedings.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <article-title>XVIII), Association for Computational Linguistics, neurips</article-title>
          .cc/paper_files/paper/2017/hash/
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>St. Julians</surname>
          </string-name>
          , Malta,
          <year>2024</year>
          , pp.
          <fpage>112</fpage>
          -
          <lpage>123</lpage>
          . URL:
          <article-title>https: 3f5ee243547dee91fbd053c1c4a845aa-Abstract.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          //aclanthology.org/
          <year>2024</year>
          .law-
          <volume>1</volume>
          .11/. html. [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Kohnke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. L.</given-names>
            <surname>Moorhouse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zou</surname>
          </string-name>
          , Chatgpt for [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jauhri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadian</surname>
          </string-name>
          , The
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <article-title>language teaching and learning</article-title>
          ,
          <source>RELC Journal 54 Llama 3 Herd of Models</source>
          ,
          <year>2024</year>
          . URL: http://arxiv.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .1177/00336882231204379. org/abs/2407.21783. [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Gilquin</surname>
          </string-name>
          , From design to collection of learner [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ainslie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee-Thorp</surname>
          </string-name>
          , M. d. Jong, Y. Zemlyanskiy,
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Research</surname>
          </string-name>
          , Cambridge University Press,
          <year>2015</year>
          , pp.
          <fpage>9</fpage>
          -
          <lpage>Checkpoints</lpage>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2305.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          34. doi:
          <volume>10</volume>
          .1017/CBO9781139649414.002. 13245. doi:
          <volume>10</volume>
          .48550/arXiv.2305.13245. [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Nesselhauf</surname>
          </string-name>
          ,
          <article-title>Learner corpora</article-title>
          and their potential [22]
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          , GLU Variants Improve Transformer,
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <article-title>for language teaching</article-title>
          , in: J. M.
          <string-name>
            <surname>Sinclair</surname>
          </string-name>
          (Ed.),
          <year>How 2020</year>
          . URL: http://arxiv.org/abs/
          <year>2002</year>
          .05202. doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <article-title>to Use Corpora in Language Teaching</article-title>
          , John Ben-
          <volume>48550</volume>
          /arXiv.
          <year>2002</year>
          .
          <volume>05202</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>jamins</surname>
          </string-name>
          ,
          <year>2004</year>
          , pp.
          <fpage>125</fpage>
          -
          <lpage>152</lpage>
          . doi:
          <volume>10</volume>
          .1075/scl.12. [23]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <given-names>11nes. S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          , W. Chen, LoRA: Low-Rank [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Meunier</surname>
          </string-name>
          , Introduction to learner corpus research,
          <source>Adaptation of Large Language Models</source>
          ,
          <year>2021</year>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <given-names>in: N.</given-names>
            <surname>Tracy-Ventura</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          Paquot (Eds.), The Rout- http://arxiv.org/abs/2106.09685.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <source>ledge Handbook of Second Language Acquisition</source>
          [24]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <article-title>Adam: A Method for Stochastic</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <source>and Corpora</source>
          , Routledge, New York,
          <year>2020</year>
          , pp.
          <fpage>23</fpage>
          -
          <lpage>36</lpage>
          . Optimization,
          <year>2017</year>
          . URL: http://arxiv.org/abs/1412. [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Davis</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Prompting</surname>
          </string-name>
          open-source and com-
          <volume>6980</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <article-title>mercial language models for grammatical error</article-title>
          [25]
          <string-name>
            <given-names>I.</given-names>
            <surname>Loshchilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          ,
          <string-name>
            <surname>Decoupled Weight</surname>
          </string-name>
          De-
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <article-title>correction of english learner text</article-title>
          ,
          <source>arXiv</source>
          (
          <year>2024</year>
          ).
          <source>cay Regularization</source>
          ,
          <year>2019</year>
          . URL: http://arxiv.org/abs/
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>URL: https://doi.org/10.48550/ARXIV.2401.07702. 1711.05101.</mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <source>arXiv:2401</source>
          .
          <fpage>07702</fpage>
          . [26]
          <string-name>
            <given-names>T.</given-names>
            <surname>Dettmers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pagnoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holtzman</surname>
          </string-name>
          , L. Zettle-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>