1. Introduction

LEARN: on the feasibility of Learner Error AutoRegressive Neural annotation

Paolo Gajo

Daniele Polizzi

Adriano Ferraresi

Alberto Barrón-Cedeño

0 0 Università di Bologna , Corso della Repubblica, 136, 47121, Forlì , Italy

2025

Error annotation is a defining feature of learner corpora, essential for understanding second-language development. Its centrality is mirrored by the meticulous efort required for its implementation, which is typically conducted in manual fashion. In this exploratory study, we investigate the feasibility of automating the task by training large language models (LLMs) in the context of dialogue-based Computer-Assisted Language Learning (CALL). We experiment with instruction-tuned LLMs across annotation granularities and prompting strategies. Results show that coarse-grained tags are more reliably predicted than ifne-grained ones, with few-shot example-based prompting outperforming context-only formats. These findings point to the potential of LLMs for semi-automatic error annotation, while underscoring the need for larger datasets and the efectiveness of training models through causal LM to handle rare linguistic phenomena. Code and data: https://github.com/paolo-gajo/LEARN

eol>large language models low-rank adaptation error annotation learner corpora human-computer interaction

1. Introduction

and discourse-level features, including [6] on apologetic expressions and [7] on evaluative stance, its applications Error annotation plays a crucial role in learner corpus in the context of learner corpus research remain scarce. research, a domain of inquiry that, while closely related To address this issue, we investigate the feasibility of to second language acquisition (SLA), is distinguished training large language models (LLMs) to automate erby its focus on providing insights into learners’ interlan- ror annotation, establishing a baseline for comparison guage systems and acquisition patterns. The underlying while focusing on an increasingly relevant mode of text assumption is that errors, defined as the application of an production: human-computer interactions [8]. The task internalised rule not prescribed by established linguistic proves particularly challenging due to the complexity of norms [1], are not merely indicators of textual quality, the tagset adopted, the model’s limited domain-specific but a reflection of learners’ evolving competence in their expertise, and the scarcity of annotated training data target language [2]. available. Our contributions are two-fold: (i) We release

Regardless of the taxonomy’s level of granularity, error a novel dataset containing 2,675 manual annotations annotation remains a time-consuming task, susceptible of linguistic errors across fifty texts. ( ii) Using LoRAto inconsistencies in human judgment and inaccuracies tuned LLMs, we assess the impact of four combinations from automatic parsers originally designed for native in- of prompting strategies on automatic error annotation put [3]. As generative AI architectures begin to populate in human-computer written interactions, establishing a linguistic toolkits [ 4 ] and mimic established approaches benchmark for future work in the area. to language analysis [5], an opportunity arises to reduce The rest of the paper is structured as follows: Section 2 the burden of manual annotation while retaining the outlines the role of learner corpora in SLA research, with depth of linguistic insight traditionally required for this a focus on error annotation practices. Section 3 introcomplex task. While a limited number of studies do in- duces the dataset and the tagset used in the experiments, vestigate the use of the technology to annotate pragmatic along with a description of the annotation process. SecCLiC-it 2025: Eleventh Italian Conference on Computational Linguis- tion 4 provides specifics on the model architecture, traintics, September 24 — 26, 2025, Cagliari, Italy ing, and evaluation. Section 5 lays out the settings ap* Corresponding author. proached for the automatic annotation task. Section 6 $ paolo.gajo2@unibo.it (P. Gajo); daniele.polizzi2@unibo.it reports the results of the experiments. Finally, Section 7 (aD.b.aPrroolinz@zi)u; naidbroia.into(A.fe.rBraarrerósin@-Cuendibeoñ.oit) (A. Ferraresi); draws conclusions and ofers suggestions on future re https://www.unibo.it/sitoweb/paolo.gajo2 (P. Gajo); search avenues. In Appendix A, we provide a full list of https://www.unibo.it/sitoweb/daniele.polizzi2 (D. Polizzi); the used categories and tags. Appendix B reports the full https://www.unibo.it/sitoweb/adriano.ferraresi/cv-en (A. Ferraresi); results. Appendix C provides information on the used https://www.unibo.it/sitoweb/a.barron (A. Barrón-Cedeño) computational resources.

0009-0009-9372-3323 (P. Gajo); 0009-0007-1927-4158 (D. Polizzi); 0000-0002-6957-0605 (A. Ferraresi); 0000-0003-4719-3420 (A. Barrón-Cedeño) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License

Attribution 4.0 International (CC BY 4.0).

2. Background and Motivation This challenge is not just one of scale, but also of

scope. Learner corpora are still predominantly focused Learner corpora are systematic collections of electronic on argumentative or academic writing, mirroring the texts whose key defining feature lies in the representation types of structured tasks performed in traditional eduof “language as produced by foreign or second language cational settings. Interactive language use, by contrast, (L2) learners” [9]. They are increasingly used in various remains significantly underrepresented and tied to semistrands of empirical SLA research, varying across multi- structured interview formats [ 13 ], which only partially ple dimensions: medium (spoken or written), genre (such capture the dynamic and co-constructed nature of realas essays, summaries and interviews), learners’ linguistic time communication. This gap is particularly problembackground, sampling strategies (synchronic, longitudi- atic given the centrality of interactionist approaches to nal or quasi-longitudinal), intended pedagogical or re- SLA, which emphasise the role of input, opportunity for search purpose, and geographical scope of data collection output, feedback, and negotiation of meaning in driving (ranging from local to large-scale initiatives) [9]. Each of acquisition [14]. As Granger [15] forecasts, the future these design parameters shapes the corpus analytical po- of learner corpus research lies not only in enhancing tential and determines its suitability for diferent lines of annotation practices but also in expanding corpora to linguistic inquiry, particularly those aimed at identifying new educational contexts, each potentially introducing developmental trajectories and persistent learner dificul- distinct patterns of learner language that call for targeted ties [10]. Their structured format also makes them a valu- annotation strategies. able resource for the development of natural language Shifts towards greater variability in learner data amprocessing (NLP) applications grounded in authentic data plify the need for scalable, adaptive annotation methods. that are used for educational purposes [ 11 ]. Our contribution presents an exploratory case study in

Central to all of these applications is the identification vestigating whether small-scale, open-weight LLMs can and classification of errors, which serve not only as indi- reliably be trained to automate learner error annotation, cators of language proficiency but also as windows into evaluating not only their diagnostic capabilities but also the evolving interlanguage systems of learners. These their alignment with linguistic taxonomies and estaberrors are signalled using a predefined taxonomy that lished error annotation conventions. More specifically, serves the purpose of assigning tags, i.e. labels captur- we test this feasibility in an unconventional setting for ing specific categories and subcategories of errors, to the learner corpora annotation: informal dialogue practice. corresponding portion of text. To ensure consistency, annotation typically follows detailed guidelines, which provide operational definitions and prototypical cases for 3. Data each tag. However, the process still requires annotators to formulate a hypothesis about the nature of each error, The dataset employed contains human–machine written interpreting the distance between the learner’s produc- interaction data, contributing to an increasingly relevant tion and the expected target form as either structural or research strand focusing on conversational AI’s efectivelinguistic per se [2]. ness for language development [14]. It features English

In spite of the subjectivity inherently built into the as-foreign-language (EFL) productions of Italian univertask, expert judgment has so far ofered the most reliable sity students aged 18–25 from diverse degree programs, means of ensuring both consistency and linguistic accu- most of whom self-report a low-to upper-intermediate racy, striking a delicate balance between introspection proficiency level. One distinct interaction for each stuand methodological rigour that underpins high-quality dent (50 in total) was collected based on a protocol comlearner corpus annotation. While projects like the Cam- bining one of two diferent LLM-based chatbots with two bridge Learner Corpus (CLC)1 and the International Cor- EFL learning scenarios. The chatbots used during the pus of Learner English (ICLE)2 have demonstrated the experimental sessions are ChatGPT,3 a general-purpose value of error-tagged data for SLA research, annotation Generative AI tool, and Pi.ai,4 a task-oriented chatbot remains labour-intensive and demands substantial ex- specifically developed to engage in natural language conpertise and time investment. The existence of automatic versation. The learning scenarios are structured around approaches to learner corpus error annotation, by con- two communicative formats that constitute part of stantrast, remains largely limited. Although some research dardised English proficiency tests: open-ended conversahas investigated advanced technologies such as LLMs for tion (small talk) and target-oriented dialogue (role play). grammatical error identification [ 12], to the best of our While small talk allows participants to freely express knowledge no published work has explored their capacity themselves on past experiences, current interests and to perform full-fledged annotation of learner language. events or future projects, role playing requires them to

1https://www.cambridge.org/elt/corpus/learner_corpus2.htm 2https://www.uclouvain.be/en/research-institutes/ilc/cecl/icle 3https://chatgpt.com/ 4https://pi.ai/talk

Source Token Count and calques have been assigned a distinct subcategory Learner-Produced (total) 17,730 (LWCO) falling within that of lexis (L) rather than form

Small talk 10,548 (F). The rationale behind this change follows on Cervini

ChaRtobloetp-Glaeynerated (total) 957,,312802 and Paone’s [17] classification of intercomprehension

Small talk 39,033 strategies, where both calques and neologisms are con Role play 56,287 ceived as pertaining to the lexical dimension of commu

Total 113,901 nication. The remaining macro-categories are retained as originally defined [ 16]. Grammatical Errors (G) are Table 1 violations of standard grammar rules that afect syntacDataset token distribution by task type. tic structure, including subject–verb agreement, misuse of tenses, article errors, or problems with word forms, such as pronouns and determiners. Lexico-Grammatical use context-sensitive vocabulary and formulaic language. Errors (X) involve combination patterns specific to the As such, both tasks prove particularly efective in cover- word rather than sentence-wide grammar, including deing a wide variety of use cases where multiple examples pendent prepositions or verb complementations. Lexical of errors might appear, ranging from grammar and lexis Errors (L) concern vocabulary choices that do not match to register and style. The dataset annotation scheme the intended meaning or context, hence coming across features structural information on turns and contextual as semantically awkward or stylistically inappropriate. information on the chatbot used, the tasks performed and Word Errors (W) target imbalances in a sentence caused the learner profile. Token counts are reported in Table 1. by omitting necessary words, adding superfluous ones, or placing words in an unnatural or incorrect order. Punc3.1. Tagset tuation Errors (Q) cover incorrect, missing, or excessive use of marks, such as commas, periods, or colons. Finally, Our benchmark for automatic error identification con- Infelicities (Z) address stylistic concerns that, while not sists of fifty texts manually annotated by two expert strictly errors, may require reformulation for the sake anglicists, using an adapted version of the Louvain Error of clarity or naturalness (Z). See Table 8 in Appendix A Tagging Manual Version 2.0 [16]. While the taxonomy for a complete list of the tags used, together with a brief does not align with any specific formal SLA theory or description of their coverage for each use case. L1–L2 pairing, it was selected precisely for its broad Errors were marked using inline XML-style tags recognition within the learner corpus research commu- of the format <TAG corr="correction">incorrect nity, a de facto standard providing a comprehensive map- text</TAG> via the Université Catholique de Louvain ping of errors discussed in the field. The adaptation was Error Tagging Editor (UCLEE).5 In case of the addition carried out through preliminary pilot tests and includes of missing words or the omission of redundant ones, several fine-tuning operations that introduce revised use the format is <TAG corr="correction">\0</TAG> cases and five new tags. The updated manual comprises or <TAG corr="\0">incorrect text</TAG>, respectively. 59 categories, spanning across eight domains: digitally- The software supports the insertion, editing and processmediated communication (DMC), form (F), punctuation ing of error tags using a preferred tagset. To accommo(Q), grammar (G), lexico-grammar (X), lexis (L), word date the specific requirements of our task, we uploaded a (W), infelicities (Z) and code-switching (CS). custom .tag file reflecting the necessary modifications we

A subset of cases previously assigned to the category had implemented. A truncated example of file annotation of formal errors, “unwarranted use of mother-tongue can be found in Figure 1. words” [16], constitutes now a separate category: namely, In line with the Louvain Manual, corrections were minthat of intra- or inter-sentential code-switching. The split imal and hypothesis-driven, ensuring that tags reflect was essential to distinguish between involuntary devi- plausible learner intentions and do not result in specuations from the expected spelling norm (covered by F, lative rewriting of the original text. Tags were assigned along with morphological errors in derivational afixes) based on the erroneous form itself, using the shortest and explicit cases of L1 interference as a coping mech- possible span required to isolate it. Regional spelling anism in second-language communication. In a similar variants (e.g., British and American English) were not fashion, all instances of missing capitalisation, includ- flagged, as participants received no instruction on preing lowercase letters at the beginning of a conversational ferred norms. Likewise, punctuation errors were annoturn, were assigned to DMC to capture features of texting tated only when they hindered readability, in recognition that likely reflect the informal nature of the task rather of informal communication habits. Cases where multiple than language competence alone. These also include ab- errors overlapped were nested within one another, with breviations commonly found in the context of instant messaging, such as BTW or LOL. Finally, neologisms plied the same error tag to mark the exact same character span as erroneous. Scores registered a mean of 0.77024 ± 0.09270. The computation was repeated a second time on all tags except those targeting formal spelling (FS) and digitally-mediated communication (DMC). That is, taking into account the most subjective among the sub-categories in our tagset, which account for 53.60% of all the tagged issues. The results show an agreement of 0.74698 ± 0.13027. Given the strictness of our criteria, we consider the obtained IAA to be highly satisfactory and reliable, since < 0 signifies worse-than-random agreement and the upper bound is = 1.

3.2. Data processing <?xml version="1.0" encoding="utf-8"?> <file name="id_1.txt" tagset="uclee-en-2.0.tag"> <text id="id_1" area_of_study="Social sciences" age="24" [...]> <task type="small talk"> <turn type="chatbot" who="Pi.ai">Hey there, great to meet you. I’m Pi, your personal AI. [...]</turn> <turn type="student">Hi</turn> <turn type="chatbot" who="Pi.ai">Hey User! How’s everything going on your side? [...]</turn> <turn type="student"><DMCC corr="How">how</DMCC> are you today?</turn> [...] </task> <task type="role play"> <turn type="student"><DMCC corr="You">you</DMCC> are an encouraging tutor who helps students improve their <DMCC corr="English">english</DMCC> by engaging in role play <FS corr="activities">actvities</FS>.>[...]</turn> <turn type="chatbot" who="Pi.ai">Great idea! Let’s start the role play. As the Restorative Justice, I’m interested in [...]</turn> [...] </task> </text> </file>

The data are compiled by filtering out the chatbot re

sponses and splitting the collection into training, development, and testing partitions with an 80/10/10 split.

Five diferent (fixed) seeds are used to split the data and

initialise model states, which helps us mitigate variance in the results. Table 2 provides information on the distribution of the tags, which has a long tail formed by rare tags, 22 of which have fewer than 10 occurrences.

As exemplified in Figure 2, we experiment with two

types of in-context learning (ICL) sections (bottom row), Figure 1: XML annotation output of the UCLEE software. each using fine- or coarse-grained tags (top row), for a total of four prompt combinations. The prompt starts with Table 2 a system message defining the LLM persona, followed Distribution of the tags in the data used for training, develop- by the instruction. The macro categories or tags are then ment, and testing. optionally listed. In the first experimental setting, a varying number of ICL examples is included. For all data Tag # splits, pairs of examples are sampled at random solely DMCC 927 LP 45 GDO 13 XNCO 4 from the training set, across any of the student-chatbot FGSA 134194 LLSSVN 4435 XQNRUC 1122 XGAPDDJCO 34 conversations. We sample an equal number of examples LSPR 80 CSINTRA 33 CSINTER 11 LCC 3 with and without error annotations.6 Finally, the task is

GNN 80 GVN 32 GPI 10 LCLC 3 repeated to mark the target sentence.

GPP 72 XVPR 28 GADVO 9 GADJO 2 In the second setting, we provide the model with the GWVOT 6643 GQNCC 2274 GGDDTI 88 XGPPRUCO 22 context of the conversation to which the target message QM 60 GVNF 23 GADJCS 7 GPO 2 belongs. Note that in this case, what we divide in 80/20/20 Z 54 DMCA 23 XNPR 7 XADVPR 1 splits is the list of conversations, rather than the individLXWVCCOO 5512 GGPVRM 2108 GQDLD 66 LGCPLFS 11 ual messages. Since conversations do not all have the WM 51 LSADV 16 FM 5 same size, in this case each seed produces diferent split

GVAUX 51 GWC 15 XADJPR 5 sizes, as shown in Table 3.

WR 49 LSADJ 15 LCS 4 In our experiments, we wish to showcase the impact of using random annotated instances vs unannotated context. Therefore, although the data partitions used in spelling errors being considered the lowest level, i.e. the the two settings are produced in diferent ways, we still ifrst correction to be applied. deem our approach to be valid, considering the use of

Inter-annotator agreement (IAA) was calculated on five diferent seeds.

ifve separate texts using the Gamma coeficient [ 18 ], a metric suited to evaluating categorical labels with overlapping text spans. Annotation files were first parsed to extract error tags and their corresponding character ofsets using a custom XML processing function.

The agreement was recorded only when annotators ap 6The original and the annotated utterances are separated by ###

symbols to avoid any subwords being merged with the separator by the used tokenisers.

You are an AI specialized in the task of annotating grammatical errors.

Annotate the target sentence below with the following tags, in XML style. Reproduce the full sentence and annotate each error. The following are the tags you should use for annotation: <DMCC>: Capitalization issues. [...] <WO>: Errors in word order.

Code-Switching: use of L1 (native language). [...]

Infelicities: stylistic concerns (not strictly errors).

Below are reference examples: Everything is going fine. How are you?###Everything is

tence: going fine. How are you? [...] The food is not very good in spain and but the atmophere Is fantastic###The food is not very good in <DMCC corr="Spain">spain</DMCC> <LCC corr="\0">and</LCC> but the <FS corr="atmosphere">atmophere</FS> <DMCC corr="is">Is</DMCC> fantastic

Below are the chat messages preceding the target senPi.ai: Hey there, great to meet you. I’m Pi, [...] student: Hi pi can we do a roleplay to help me practice my english? Pi.ai: Absolutely, User! Role-playing can be a great way [...] student: I would like to do a customer service scenario

Pi.ai: Sure thing! Let’s start the [...] Annotate the following target sentence, without providing any explanation:

Yes please, I would like a bottle of water and a glass of wine### Split sizes for the training, development, and testing partitions, sentences not just from the target sentence, but also from for the random ICL sampling and context prompt settings.

Model

Train 831

Dev 104

Test 104 8The model needs to be given the prompt in a chat template (https://huggingface.co/docs/transformers/en/chat_templating# applychattemplate) which we omit here for clarity. since we want the model to learn to predict the annotated the tags and the examples included in the prompt. In other words, we simultaneously train the model on a large amount of sampled examples within the prompt, through teacher forcing, and we also instruction-tune it to predict the desired target sentence.

The architecture of these models consists in a token/positional embedding layer, followed by a stack of decoders, with a language modeling classifier on top. Each decoder comprises a grouped-query attention layer [21], followed by a set of MLP layers each using a SwiGLU activation function [22]. We update the weights of the decoder blocks with LoRA [23], only targeting the key, query, and value matrices , , of the attention layers: ︂( ⊤ √

︂) +

Attention(, , ) = Softmax where is the matrix filled with zero values in the lower triangular part and −∞ output dimension of and . The attention and MLP layer parameters are kept frozen during training. The original input to these layers is simultaneously processed through LoRA components consisting of weight matrices ∈ R1× and ∈ R× 2 , where ≪ represents the low-rank projection dimension, while 1 and 2 correspond to the input and output dimensions of each respective layer. During training, only the LoRA matrices and receive parameter updates. Thus, the forward pass of an input x through an MLP with frozen 1, 2. Here, elsewhere, and is the weight 0 is modified as: 2 × 10− 4 with 5 warm-up steps, weight decay of 0.01, and AdamW [25] as the optimization algorithm. Prior to ifne-tuning, Llama-3.3-70B-Instruct is quantized at 4-bit precision with QLoRA [26], using bitsandbytes.9

Due to the sparsity of low-occurrence tags, we focus on evaluating the model on the most common ones using micro-averaged precision, recall, and F1-measure. The prediction of a tag is considered correct only if both the tag and the associated text match. For example, in the sentence <DMCC corr="Not">not</DMCC> really, what is your proposal <QM corr="?">\0</QM> the prediction would be incorrect if the tag DMCC was assigned to “not really” rather than just “not”. As regards this example, also note that the model is required to generate “\0” tokens, representing omitted words.

Each model is fine-tuned and evaluated on five difer

ent seeds, for which we report the average performance along with the standard deviation. During evaluation, we allow the model to generate up to 1,000 new tokens, which we deem suficient based on instance lengths. We select the best epoch based on the highest micro-averaged

F1-measure on the development set. We report micro

averaged metrics, since macro-averaging does not provide a faithful picture of model performance, due to the long tail of low-occurrence classes (Table 2).

5. Experiments We task the fine-tuned models to automatically annotate glish. We experiment with two levels of granularity of error classification, one at the level of the macro category i.e. those listed in Table 2.

We also use two diferent types of prompts. The first includes ICL ∈ {0, 2, 4, 6, 8, 10} pairs of unannotated and annotated student messages. We vary the number because an insuficient amount might not provide the model with enough information to produce optimal performance, while an excessive quantity might excessively shift attention from the target task. The second type of

9https://github.com/bitsandbytes-foundation/bitsandbytes

Llama-3.3-70B-Instruct ✓ × × × ✓ ✓ 0 2 4 6 8 0 2 4 6 8 6 10 10 0 2 4 6 8 0 2 4 6 8 6 10 10 linguistic errors in sentences written by learners of En- prompt includes the = 10 chat messages preceding the (e.g., “Form”, or “Punctuation”) and one at the tag level, parameter search as regards the number of in-context Llama-3.3-70B-Instruct

6. Results

Random sampling ICL

The results marginalised

across all classes for the fine-grained setting are listed in pairs of examples, 6 positive and 6 negative. This shows Table 6 our concerns with finding the best number of examples Overall micro-averaged results for Llama-3.1-8B-Instruct and were founded, since higher amounts lead to increasingly Llama-3.3-70B-Instruct for the context prompt setting, using worse performance. However, most of the performance fine-grained (ℱ ) and coarse () categories. gain is obtained by going from ICL = 0 to even just Tags F1 Precision Recall providing 2 pairs of examples, even without the model Llama-3.1-8B-Instruct being shown the meaning of the tags. Indeed, overall wthheebnensottreinscullutsdifnogr tLhleamtagas-3a.1n-d8Bth-eInirstdreuscctriaprteioancshiinevtehde ℱ ✓× 00..220271 ±± 00..009719 00..223576 ±± 00..100701 00..119948 ±± 00..009833 spuroltms,pwt.hGeraejoinacnrdeaBsainrrgótnh-eCneudemñboe[r2o7f]erxeapmorptlesismyiilealrdreed- ✓× 00..128364 ±± 00..005960 00..221745 ±± 00..100907 00..129018 ±± 00..007858 diminishing returns when extracting RDF triples from Llama-3.3-70B-Instruct texts and overly long lists of references in the prompt ℱ × 0.395 ± 0.109 0.360 ± 0.088 0.375 0.095 diluted model attention away from the target task. × 0.455 ± 0.084 0.417 ± 0.076 0.434 0.077

Fine-tuning Llama-3.3-70B-Instruct with the best hyperparameter ICL = 6 and no tags in the prompt, the model obtains a micro-F1 of 0.472. Out of five seeds, the this hints at the fact that the model more easily handles highest validation performance is obtained twice on the structural errors, compared to those where style and ifrst epoch, twice on the second, and only once on the semantics are involved. third. Since the model is only shown 831 training exam- Table 10 in Appendix B reports the results for each ples and the first and second epochs already provide the coarse-grained category for all values of ICL. best performance, the model seems to fit very quickly to the patterns it needs to recognize to identify errors. Context ICL As shown in Table 6, the performance

The overall results for the coarse-grained categories using context prompts is much lower than when using are reported in Table 5. The performance is overall randomly sampled example pairs. An analysis of Llamaslightly higher when including the categories in the 3.1-8B-Instruct’s predictions shows that, at times, the prompt. In this case, since only 9 classes are listed, the model makes mistakes even on easy instances of the DMC model is able to make good use of the provided informa- category, i.e. the one with overall highest results. For extion. Indeed, not only are the mean scores higher, but ample, in “student: It’s perfect! Thank <XVCO the standard deviation is also lower at ICL = 6, which corr="you">u</XVCO> so much”, the model assigns is the setting that yields the highest performance with XVCO (errors with verb complementation) rather than Llama-3.1-8B-Instruct. As for Llama-3.3-70B-Instruct, per- DMCA to a clear-cut case of Internet-style abbreviation. formance is greater, but with a smaller gap between the Considering the performance on this class is above 0.800 two models, compared to the fine-grained tags. when using random ICL example pairs, this is a clear

The full results for each fine-grained tag at all values hint that the context does not provide useful information of ICL are reported in Table 9 in Appendix B. At the for the best-performing categories. Indeed, the macroifne-grained level, only a few high-frequency tags such categories for which contextual information is likely to as DMCC (927 instances) and FS (314) are predicted reli- be most relevant are lexis (L) and infelicities (Z), where ably. Most of the others are either predicted with very discourse-level or pragmatic cues are critical in assessing high standard deviations or do not receive predictions at appropriateness and distinguishing genuine errors from all, due to the sparsity of labels. Nonetheless, the perfor- stylistic deviations. However, as shown in Table 7, the mance for several morphosyntactic tags, e.g. GNN (80), performance for these categories is very low (L) or null GPP (72) and GVAUX (51) exhibits gradual improvements (Z). For Llama-3.1-8B-Instruct, the performance on the L with increasing values of ICL, indicating that training category (F1 = 0.070) is worse than the one obtained in the model on a higher number of examples might be the random ICL sampling setting, even with ICL = 0 beneficial for some classes. (F1 = 0.091, see Table 10). Therefore, even in the cases

Based on the distribution shown in Table 2, the amount in which the model would supposedly benefit from being of training instances per class indeed seems to strongly provided the context of the conversation, simply having correlate with performance. However, Z (54), used to it memorize decontextualized examples through causal indicate stylistic problems, is never predicted correctly by language modeling provides better performance. Indeed, either of the models, despite having a number of instances as already mentioned in the previous section, the model comparable to that of much better-performing classes, likely pays more attention to the shallow structure of the e.g. QM (60) or WM (51), respectively used for missing sentence rather complex semantic relationships. Thus, punctuation and words. Since the latter clearly afect having it learn annotations directly from XML-formatted the format and structure of the sentence via omission, examples provides superior performance. This is also Table 7 mance improved via ICL examples, peaking around 6 Micro-averaged F1 results per category for Llama-3.1-8B- pairs of positive and negative instances, before exhibiting Instruct and Llama-3.3-70B-Instruct with the best-performing diminishing returns. This trend held across both granuICL = 6 using coarse-grained categories. C=CS, D=DMC. larities and prompt types, although not always linearly.

Rng (ICL = 6) Context ( = 10) In particular, random example-based prompts yielded Tags 8B 70B 8B 70B scuobnstteaxnt-toianlllyy hoingehse, rfoarnbdomthortehestfinaeb-learnedsuclotasrcsoem-gpraarinededto C × 0.050 ± 0.112 0.000 ± 0.000 annotation tasks, suggesting that focused demonstra✓ 0.197 ± 0.192 0.175 ± 0.186 0.000 ± 0.000 0.053 ± 0.119 tion of error-tag mappings better supports autoregressive D × 0.813 ± 0.059 0.512 ± 0.149 modeling than situational grounding. The lower efective✓ 0.827 ± 0.051 0.854 ± 0.036 0.552 ± 0.130 0.759 ± 0.088 ness of context-only prompts may also reflect a mismatch F × 0.534 ± 0.047 0.269 ± 0.088 between the data and the annotation scheme, where error ✓ 0.497 ± 0.123 0.551 ± 0.090 0.155 ± 0.060 0.433 ± 0.103 identification, at least of the issues observed in these conG ✓× 00..320467 ±± 00..002359 00..303934 ±± 00..004415 0.075 ± 0.037 0.242 ± 0.061 versations, is mostly self-contained within each learner’s turn. Including additional text to be processed likely Z ✓× 00..000000 ±± 00..000000 00..000000 ±± 00..000000 0.000 ± 0.000 0.000 ± 0.000 dilutes the model’s attention, which is spread across a higher number of tokens, ultimately lowering learning X ✓× 00..006648 ±± 00..009654 00..101070 ±± 00..104090 0.000 ± 0.000 0.038 ± 0.054 efectiveness.

At a tag-specific level, results highlight the challenges

L ✓× 00..116587 ±± 00..007564 00..200615 ±± 00..005412 0.070 ± 0.050 0.103 ± 0.048 of sparse class supervision for this task, with only a handful of high-frequency labels being predicted reliQ ✓× 00..212924 ±± 00..110229 00..206020 ±± 00..100020 0.000 ± 0.000 0.184 ± 0.200 ably. Nonetheless, we provide evidence of LLMs being able to internalise recurring learner patterns through W ✓× 00..006861 ±± 00..006673 00..101070 ±± 00..102090 0.000 ± 0.000 0.050 ± 0.090 causal LM, given they are shown enough instances.

Variation across the explored hyperparameters was modest. This implies that the performance ceilings are clear based on the fact that Llama-3.1-8B-Instruct can primarily determined by task complexity and data sparoutperform its bigger counterpart just by changing the sity, rather than the suboptimal nature of specific training prompting strategy, although the performance obtained approaches. by Llama-3.3-70B-Instruct when using context prompts In future work, we plan to produce synthetic training is closer to the one obtained with random sampling ICL. data for the task approached in this work, in order to im

The context ICL results for all fine-grained tags can be prove model performance. In addition, we wish to extend found in Table 11 in Appendix B. the annotation to additional resources and leverage them for the development of better automatic error annotation systems. Finally, we aim to evaluate model performance 7. Conclusions also in terms of the proposed corrections. In this study, we have built a corpus of human-computer interactions, assessing the feasibility of fine-tuning LLMs to automatically carry out error annotation. Through a series of experiments across two annotation granularities (coarse and fine-grained), we evaluated the capabilities and limitations of both Llama-3.1-8B-Instruct and Llama

3.3-70B-Instruct to learn through causal LM from two

prompting paradigms. The first included the conversation context of the message requiring annotation, while the other entailed a varying number of randomly sampled

ICL examples. Both prompt types optionally included

explicit information about the target error classes.

Perhaps unsurprisingly, coarse-grained annotation ob

tains better scores than fine-grained tagging across all configurations, suggesting the viability of a hybrid, semiautomatic pipeline where LLMs handle broader error categories before finer distinctions are resolved through human post-editing or specialised tools. Model perfor

Acknowledgments We express our sincere gratitude to Arianna Paradisi

(University of Bologna) for her valuable support and insightful contributions to the development of the error tagging manual. Her expertise and collaboration were instrumental in shaping the guidelines used in this work.

We also thank the research team of the UNITE - UNiver

sally Inclusive Technologies to practice English10 project for providing the resources that made this study possible. 10UNITE – UNiversally Inclusive Technologies to practice English (funded by the European Union – NextGenerationEU under Italy’s National Recovery and Resilience Plan (PNRR); Project code 2022JB5KAL, CUP J53D23008070006 moyer, Qlora: Eficient finetuning of quantized llms, arXiv preprint arXiv:2305.14314 (2023). [27] Gajo, Barrón-Cedeño, Natural vs Programming

Language in LLM Knowledge Graph Construc

tion, Information Processing & Management 62 (2025) 104195. URL: https://www.sciencedirect.com/ science/article/pii/S0306457325001360. doi:https: //doi.org/10.1016/j.ipm.2025.104195.

Categories (in italics), descriptions, and references for the error tags used in corpus annotation.

Description Digitally-Mediated Communication

<DMCC> Capitalization issues.

Errors in coordinating conjunctions. <LCS> Errors in subordinating conjunctions. Errors with single logical connectors. <LCLC> Errors with complex logical connectors. Conceptual/collocational errors with adjectives. <LSADV> Conceptual/collocational errors with adverbs. Conceptual/collocational errors with nouns. <LSPR> Conceptual/collocational errors with prepositions. Conceptual/collocational errors with verbs. <LWCO> Coined words or calques.

Errors in fixed word combinations, including idioms, compounds, and phrasal verbs.

Missing words.

Word order errors.

<WR>

Redundant words.

Code-switching within a sentence.

Code-switching between sentences or turns.

Stylistic problems or unclear sequences requiring reformulation.

A. Full list of tags C. Computational resources In this section, we report on the tagset used for the learner For each prompt type, training Llama-3.1-8B-Instruct took

error annotation task, a revised version of the UCLou- ∼ 20 minutes on a single NVIDIA H100 (96GB of VRAM), vain Error Editor Version 2. Table 8 lists all of the error macro- and micro-categories, their specific tags, and a brief description of each tag.

B. Full results Here, we report the full results for Llama-3.1-8B-Instruct

and Llama-3.3-70B-Instruct. The results for the random

ICL sampling setting are reported in Table 9 for the fine

grained tags and in Table 10 for the coarse-grained categories. The results for the fine-grained categories in the context prompt setting are reported in Table 11. for a total of about 17 hours over all the 50 combinations of seeds and hyperparameters. Training Llama-3.3-70B

Instruct for each of its five runs per setting took around

90 minutes, for an additional 15 hours for the two prompt types. FS GA GADVO GDI GNC GNN GPI GPP GPR GVAUX GVM GVN GVNF GVT GWC LP LSADJ LSADV LSN LSPR LSV LWCO QC QM WM WO WR XADJPR LSN LSPR QM XVCO Declaration on Generative AI

[13] Centre for English Corpus Linguistics , Learner cor-

pora around the world , 2024 . [1]

Berruto , Le regole in linguistica , in: N. Grandi [14]

Bibauw , W. Van den Noortgate, T. François,

Press, Bologna, 2015 , pp. 43 - 61 . A meta-analysis, Language Learning & Technol[2]

Lüdeling ,

Hirschmann , Error annotation sys- ogy 26 ( 2022 ) 1 - 24 . URL: https://www.lltjournal.

tems, in: S. Granger,

Gilquin ,

Meunier (Eds.), org/item/10125- 73488 /.

The Cambridge Handbook of Learner Corpus Re- [15]

Granger , Learner corpora and error annotation:

search , Cambridge University Press, 2015 , pp. 135 - Where are we and where are we going? , Interna-

157. doi: 10 .1017/CBO9781139649414.007. tional Journal of Learner Corpus Research 10 ( 2024 ) [3]

Gilquin , Learner corpora , in: M. Paquot , S. T. 25 - 45 . doi: 10 .1075/ijlcr.00008.gra.

Gries (Eds.), A Practical

Handbook of Corpus Lin- [16] S.

Granger , H.

Swallow , J.

Thewissen , The lou-

guistics, Springer, Cham, 2020 , pp. 283 - 303 . vain error tagging manual version 2 .0, 2022 . [4]

Anthony , Corpus ai: Integrating large language URL: https://oer .uclouvain.be/jspui/bitstream/

models (llms) into a corpus analysis toolkit , 2023 . 20 .500.12279/968/4/Granger%20et% 20al ._Error%

URL: https://osf.io/srtyd/. 20tagging%20manual%202 . 0_final_CC.pdf . [5]

Curry ,

Baker , G. Brookes, Generative ai for [17]

Cervini , E. Paone, Comunicare all'universitÀ:

evaluation of chatgpt, Applied Corpus Linguistics LinguaDue 16 ( 2024 ) 496 - 523 .

4 ( 2024 ) 100082 . [18]

Mathet ,

Widlöcher ,

Métivier , The unified [6]

Yu ,

Li ,

Su , M.

Fuoli, Assessing the po- and holistic method gamma ( ) for inter-annotator

pragmatics and discourse analysis: The case of apol- Linguistics 41 ( 2015 ) 437 - 479 . URL: https://doi.

ogy , International Journal of Corpus Linguistics 29 org/10 .1162/COLI_a_00230. doi: 10 .1162/COLI_

( 2024 ) 534 - 561 . doi: 10 .1075/ijcl.23087.yu. a_ 00230 . [7]

Imamovic ,

Deilen ,

Glynn , E. Lapshinova- [19]

Vaswani ,

Shazeer ,

Parmar , J. Uszkoreit,

in: S. Henning, M. Stede (Eds.), Proceedings of Information Processing Systems , volume 30 , Cur-

the 18th Linguistic Annotation Workshop (LAW- ran Associates , Inc., 2017 . URL: https://proceedings.

XVIII), Association for Computational Linguistics, neurips .cc/paper_files/paper/2017/hash/

St. Julians , Malta, 2024 , pp. 112 - 123 . URL: https: 3f5ee243547dee91fbd053c1c4a845aa-Abstract.

//aclanthology.org/ 2024 .law- 1 .11/. html. [8]

Kohnke ,

B. L.

Moorhouse ,

Zou , Chatgpt for [20]

Dubey ,

Jauhri ,

Pandey ,

Kadian , The

language teaching and learning , RELC Journal 54 Llama 3 Herd of Models , 2024 . URL: http://arxiv.

( 2023 ). doi: 10 .1177/00336882231204379. org/abs/2407.21783. [9]

Gilquin , From design to collection of learner [21]

Ainslie ,

Lee-Thorp , M. d. Jong, Y. Zemlyanskiy,

Research , Cambridge University Press, 2015 , pp. 9 - Checkpoints , 2023 . URL: http://arxiv.org/abs/2305.

34. doi: 10 .1017/CBO9781139649414.002. 13245. doi: 10 .48550/arXiv.2305.13245. [10]

Nesselhauf , Learner corpora and their potential [22]

Shazeer , GLU Variants Improve Transformer,

for language teaching , in: J. M. Sinclair (Ed.), How 2020 . URL: http://arxiv.org/abs/ 2002 .05202. doi:10.

to Use Corpora in Language Teaching , John Ben- 48550 /arXiv. 2002 . 05202 .

jamins , 2004 , pp. 125 - 152 . doi: 10 .1075/scl.12. [23]

E. J.

Hu ,

Shen ,

Wallis ,

Allen-Zhu ,

Li ,

11nes. S.

Wang ,

Wang , W. Chen, LoRA: Low-Rank [11]

Meunier , Introduction to learner corpus research, Adaptation of Large Language Models , 2021 . URL:

in: N.

Tracy-Ventura , M. Paquot (Eds.), The Rout- http://arxiv.org/abs/2106.09685.

ledge Handbook of Second Language Acquisition [24]

D. P.

Kingma ,

Ba , Adam: A Method for Stochastic

and Corpora , Routledge, New York, 2020 , pp. 23 - 36 . Optimization, 2017 . URL: http://arxiv.org/abs/1412. [12]

Davis , et al., Prompting open-source and com- 6980 .

mercial language models for grammatical error [25]

Loshchilov ,

Hutter , Decoupled Weight De-

correction of english learner text , arXiv ( 2024 ). cay Regularization , 2019 . URL: http://arxiv.org/abs/

URL: https://doi.org/10.48550/ARXIV.2401.07702. 1711.05101.

arXiv:2401 . 07702 . [26]

Dettmers ,

Pagnoni ,

Holtzman , L. Zettle-