1. Introduction

of the Language Learning Development Task

Chiara Alzetta

chiara.alzetta@ilc.cnr.it 0 1 2 3 5

Dominique Brunato

dominique.brunato@ilc.cnr.it 0 1 2 3 5

Felice Dell'Orletta

felice.dellorletta@ilc.cnr.it 0 1 2 3 5

Alessio Miaschi

alessio.miaschi@ilc.cnr.it 0 1 2 3 5

Kenji Sagae

sagae@ucdavis.edu 1 2 3 4 5

Claudia H. Sánchez-Gutiérrez

1 2 3 4 5

Giulia Venturi

giulia.venturi@ilc.cnr.it 0 1 2 3 5 0 ItaliaNLP Lab, CNR, Istituto di Linguistica Computazionale 'A.Zampolli' , Pisa , Italy 1 Processing and Speech Tools for Italian , Sep 7 - 8, Parma, IT 2 The Language Learning Development Task , LangLearn 3 The assessment of language development is cast in Lan- 4 University of California , Davis , USA 5 written language competence of L2 Spanish and L1 Italian

2023

Language Learning Development (LangLearn) is the EVALITA 2023 shared task on automatic language development assessment, which consists in predicting the evolution of the written language abilities of learners across time. LangLearn is conceived to be multilingual, relying on written productions of Italian and Spanish learners, and representative of L1 and L2 learning scenarios. A total of 9 systems were submitted by 5 teams. The results highlight the open challenges of automatic language development assessment. Language learning development, student essays, shared task, multilingual language learning assessment learners' language and to study how it evolves over time, overall writing ability and its development during later theoretical considerations into educational applications, on English. Few exceptions are represented by e.g. the such as Intelligent Computer-Assisted Language Learn-

1. Introduction Over the last twenty years, there has been a growing

interest in exploiting the potential of Natural Language Processing (NLP) tools to characterize the properties of both in first (L1) and second language (L2) acquisition scenarios. A similar concern has been paid to turning ing (ICALL) systems [ 1 ] and tools for automatically scoring learners’ writing with respect to language proficiency and writing quality [ 2, 3 ], and more generally systems able to automatically assign a learner’s language productionalize sophisticated metrics of language development thus alleviating the laborious manual computation of these metrics by experts [ 6, 7, 8 ].

Generally, a greater number of studies has been carried out in the field of L2 learning where the study of L2 writings is seen as a proxy for language ability development [ 9 ]. In this respect, much work is devoted to predicting the degree of L2 proficiency according to expert-based ical structures’ competence with respect to predefined grades, e.g. the Common European Framework of RefEVALITA 2023: 8th Evaluation Campaign of Natural Language the same student, a document should have a higher tion to a given developmental level [ 4, 5 ] or to opera- learners, respectively. evaluation [ 10 ] or to modeling the evolution of grammat- and Spanish learners, and representative of L1 and L2

3. Datasets

quality level with respect to the ones written previously ( ). Specifically, we followed the approach devised by [19]: given a randomly ordered pair of essays ( 1, 2) written by the same student, we ask to predict whether 2 was written before 1.

LangLearn was articulated in two sub-tasks based on the resources allowed for training the models.

In line with the aim of having a multilingual shared task, we distributed two datasets composed of essays written by learners of the Italian and Spanish languages. Notably, the two datasets reflect an additional dimension of variation, which is the diferent learning scenarios from which the written productions were obtained. Specifically, the collection of Italian essays was written by students learning Italian as their first language, while the Spanish essays were produced by L2 learners.

For each corpus, LangLearn participants were provided with two files: i.e. reflexive, narrative, descriptive, expository and argumentative.

For the purposes of the LangLearn shared task, we selected a subset of 882 essays authored by 133 diferent students at diferent time intervals. A time interval is identified by the year and specific period during which each essay was produced (e.g., the label 1_4 denotes the fourth essay written during the first year). Specifically, • Sub-task 1 consists in predicting the order of es- we considered 11 intervals, six for the first year and five says using only the oficial training data released for the second one. As it can be seen in Figure 1, the for the task; essays feature diverse linguistic characteristics across • Sub-task 2 consists in predicting the order of es- the considered time intervals1. In fact, essays written says using information acquired from the training in the first year tend to be shorter in terms of the total data released for the task and also from additional number of tokens than those produced in the second external resources. year. Interestingly, the length of the document is a raw text feature highly related to various linguistic aspects that shape the writing style of an essay. Furthermore, the essays are increasingly lexically richer across time, as emerged from the Type/token ratio (TTR) values calculated for the first 100 tokens of the texts. It is worth noting that the last essays of the second year (interval 11) deviate from this trend. This is possibly due to the fact that they are mostly related to similar prompts that involved completing a history. In this case, students tend to write shorter and less lexically varied essays.

In order to build the training and test sets of LangLearn, essays from each student were paired based on their chronological order of writing, ensuring that the first essay in each pair was written prior to the second. This process resulted in 2,673 essay pairs: 2,366 were assigned • a .tsv file containing the following information to the train set, and the remaining 307 were placed in pertaining to a pair of essays ( 1, 2) written by the test set. The distribution of pairs across time interthe same student: IDs of 1 and 2 in the correct vals is reported in Table 1. Note that some time interval chronological order, and 1 and 2 corresponding pairs (e.g. 1_2 − 1_5) appear only in the test set. This is to the time of writing of 1 and 2. done to challenge participants since they do not have any • an XML file containing the text of the essays with corresponding pairs within the train set. Similarly, we randomly generated document IDs, as in the ex- isolated 4 students whose essays appear only in the test ample below: set, while the essays of 49 students appear only in the <dataset> train set and the essays of 80 students appear in both sets. <doc id="9843"> Essay </doc> Indeed, it is possible for the same essay to appear in both <doc id="7432"> Essay </doc> the training and test sets, but it would appear in diferent </dataset> pairs, ensuring that a specific pair occurs exclusively in either the train or test set.

3.1. Corpus Italiano di Apprendenti L1

(CItA)

3.2. Corpus of Written Spanish of L2 and Heritage Speakers (COWS-L2H)

CItA (Corpus Italiano di Apprendenti L1) [21] is a longi- The COWS-L2H (Corpus of Written Spanish of L2 and Hertudinal corpus of essays written by the same L1 Italian itage Speakers) corpus [22] consists of 3,498 short essays students in the first (2012-2013) and second year (2013- written by second language (L2) students enrolled in one 2014) of lower secondary school. The original corpus of ten lower-division Spanish courses at a single Americontains a total of 1,352 essays written by 156 students. can university. Student compositions in the corpus are The essays belong to five textual typologies, which reflect the diferent prompts students were asked to respond to,

1Note that these two linguistic characteristics are those used to

compute the baseline scores. written in response to one of six writing prompts, which pairs, and the test set contains 320 essay pairs. The time are changed periodically. According to these prompts, in interval between essays in a pair usually consists of one, each essay, the student is asked to write about: a famous two or three academic terms, with each term correspondperson, your perfect vacation plan, a special person in ing to 10 weeks of courses (Table 2). It is important to your life, a terrible story, a description of yourself, or a note that these intervals are not easily comparable across beautiful story. During each period (an academic quarter, datasets, since COWS-L2H deals with highly structured which consists of ten weeks of courses) of data collection, L2 instruction, which progresses diferently from L1 writstudents are asked to submit two compositions, approxi- ing. mately one month apart, in response to the previously mentioned prompts. These composition themes are designed to be relatively broad, to allow for a wide degree 4. Evaluation of creative liberty and open-ended interpretation by the Baseline The baseline scores were calculated by trainwriter. ing a LinearSVM using, for each pair ( 1, 2), the number

To select essays from the original COWS-L2H dataset of tokens per document (in each pair) and the type/token for the LangLearn task, we considered only essays writ- ratio of the first 100 tokens in each document as input ten by students who wrote essays in two separate aca- features. demic terms. This way, we can pair essays written at diferent points in time by the same student. To reduce the possibility that factors independent of language learn- Metrics The models’ performance achieved on the ing could systematically diferentiate between essays in CItA and COW-L2H test sets have been evaluated ina pair, we considered only pairs of essays written in re- dependently using Accuracy (A) and F1-score (F-score). sponse to the same prompt. With these constraints, we were left with 1,329 pairs of essays written by 440 stu- 5. Submitted Systems and dents. To split these essay pairs into training and test sets, we selected the essays written by 330 students to be in Participants the training set, and the essays written by the remaining 110 students to be in the test set. This means that, in contrast with the CItA dataset used in LangLearn, there is no overlap in essays or authors between the training and test sets. The resulting training set contains 1,009 essay Following a call for interest, 5 teams registered for the task and submitted their predictions for both datasets, for a total of 18 runs (namely, 9 for each language tackled in the shared task). Eventually, one team (i.e. aroyehun_angel) did not submit a system report, thus we included 1_1 1_2 1_3 1_4 1_5 1_6 2_1 2_2 2_3 2_4 1_1 1_2 1_3 1_4 1_5 1_6 2_1 2_2 2_3 2_4 their scores in the overall dashboard, but we excluded them from the system description and error analyses. As shown in Table 3, all teams participated only in sub-task 1.

BERT_4EVER [23] proposed three diferent systems

based on the base Italian BERT2 model [24]. For finetuning the models, the team augmented the CItA and COWS-L2H datasets by reversing essay pairs to obtain negative examples and generating new positive examples by constructing transitive pairs. In the first system, BERT, BERT was fine-tuned performing simultaneous training on the augmented CItA and COWS-L2H datasets. The second model, Sequential, employs a novel sequential information attention mechanism to capture the interaction between the essays in a pair, which allows for incorporating the attention weights derived from the lastwritten essay in the representation of the pair relying on the [CLS] token and using average pooling. This pair representation is then fed into a linear classifier with a softmax function. The third model proposed is the Merge one, which fuses BERT and Sequential by averaging their output probabilities.

2https://huggingface.co/dbmdz/bert-base-italian-uncased

bot.zen [25] tackled LangLearn as a regression problem, where the goal was to determine the stage of the learning process at which a student wrote a text. To achieve this, the team first pre-processed the oficial training sets in order to acquire the absolute order of each essay written by a student. Then, they performed predictions relying on an ensemble of decision tree algorithms. The model was trained using 125 normalised features capturing lexical and morpho-syntactic properties for each essay. By using MALT-IT2 [26], the team was able to include a set of features measuring text complexity in terms of document length, and lexical, syntactic, and morpho-syntactic properties. These features, however, are available only for the Italian language, thus they were used only for CItA predictions. BERT_4EVER aroyehun_angel bot.zen IUSSnets ExtremITA

Members 4 IUSS-Nets [27] approached LangLearn using linguistics features (e.g. density of various part-of-speech categories, frequency of diferent kinds of syntactic constituents, mean sentence length, etc.) extracted using the existing Common Text Analysis Platform, or CTAP [28], and surprisal-based metrics derived from token probabilities obtained using pretrained language-specific BERT models. These diferent pieces of information were encoded in features used in random forest classifiers. Interestingly, unlike most systems in LangLearn, which obtained better performance on the CItA dataset than on COWS-L2H, this approach produced higher accuracy and F-Score on COWS-L2H. In fact, it produced the strongest results on the COWS-L2H dataset among those submitted. Although its performance on CItA was not among the strongest submitted, it was still substantially above the baseline.

ExtremITA [29] team participated in the task with two

Language Models trained in a multi-task learning framework. The first model is an encoder-decoder based on IT5-small [30], while the second model was a decoder based on Camoscio [31], the Italian version of LLaMA Table 4 [32]. These models show substantial diferences in terms LangLearn shared task leaderboard. of parameter count, with IT5-small comprising around 110 million parameters, whereas the utilized version of Camoscio encompasses 7 billion parameters. Both models underwent joint fine-tuning on all EVALITA 2023 the dataset, inverted sentence pairs were incorporated, tasks and sub-tasks, leveraging prompting techniques. resulting in an expansion of the dataset from 3,377 to Specifically, for the LangLearn task, the extremIT5 model 6,438 examples. received each instance of the dataset with the task name preceding it as input, and it produced the predicted la- 6. Results bel as output. Conversely, the extremITLLaMa model, which requires a structured prompt, was provided with Table 4 reports the leaderboard of systems participata textual description of the task and the desired output ing in the LangLearn shared task. Most systems outperformat specification, as follows: “Questi due testi sepa- formed the baseline when tested on CItA dataset while rati da [SEP] sono presentati nell’ordine in cui sono scritti? surpassing the baseline proved to be more challenging on Rispondi sì o no”. As regards the dataset treatment, some COWS-L2H dataset. The team BERT_4EVER submitted preprocessing steps were adopted: firstly, the dataset was the best-performing systems in the L1 scenario, while segmented into sentences, allowing a maximum of 100 the highest score for the Spanish dataset was achieved tokens per sentence. Additionally, in order to augment by the IUSS-Nets team. ExtremITA obtained the lowest CItA Team BERT_4EVER-BERT BERT_4EVER-Merge BERT_4EVER-Sequential aroyehun_angel-system2 aroyehun_angel-system1 bot.zen IUSS-Nets ExtremITA-camoscio-lora Baseline ExtremITA-it5 scores on both datasets.

Overall, we observe varying system rankings across the two learning scenarios. We discuss such variation in more depth in the next Section.

7. Discussion Upon examination of system performance, we notice differences in model performance between the CItA and Table 5

COWS-L2H datasets. Considering that each dataset re- Results on the CItA Test set considering only unseen students. lfects a diferent learning scenario, this might indicate that the challenges posed by these scenarios were distinct.

One notable finding is that models leveraging stylistic duced by L2 learners may primarily serve as a measure properties of essays, such as the IUSS-Nets model, were of their progress in acquiring these new, more complex more efective in the L2 setting. On the other hand, teams structures. On the other hand, L1 learners may face chalthat employed Neural Language Models achieved higher lenges from their teachers to enhance their proficiency in results on the CItA dataset. accurately using linguistic structures they have already

The observed diferences in performance might be at- acquired. As a consequence, L2 essays may exhibit more tributed to two main factors: model architectures and significant stylistic variations as learners are faced with specific properties of the two learning scenarios. Con- the acquisition of new language structures. In contrast, cerning the former, we highlight, for instance, that the L1 essays over time may show a more accurate use of BERT model used by the BERT_4EVER team was pre- already familiar linguistic structures, highlighting the trained only on Italian texts. This choice likely con- learners’ mastery of these elements. tributed to its lower performance on COWS-L2H, despite To deepen our analyses on the CItA dataset, we comthe simultaneous fine-tuning on both CItA and COWS- pared the system performance on a subset of essay pairs L2H. In fact, while BERT was the best-performing model that correspond to the most challenging prediction sceof the BERT_4EVER team and overall on CItA, it was nario, i.e. considering pairs involving students whose essurpassed on Spanish essays by their Sequential model, says appear only in the test set. The results on this subset which incorporates information about the interaction are reported in Table 5. As can be noted, the system rankbetween the essays in a pair. Similar observations can be ing remains unvaried, but the bot.zen and BERT_4EVER made for the bot.zen and IUSS-Nets teams. Both teams systems sufer a drop in their performance on this setemployed classification models that leverage a set of ex- ting. The main cause of the decline in scores is due to plicit features capturing linguistic properties of the texts. the increased complexity of this particular setting. In While both teams exploited features measuring raw text fact, systems cannot rely on information extracted from properties and the distribution of part-of-speech and syn- essays present both in the training and test sets, although tactic dependencies for both languages, they difered in paired with diferent essays. As a result, the systems must terms of features that captured deeper textual properties. rely solely on their generalization abilities to discern sigSpecifically, IUSS-Nets achieved the highest score on the nificant variations within each essay pair. However, it COWS-L2H dataset thanks to a wide set of features mea- is important to acknowledge that even in this particular suring text complexity, sophistication, refinement, lexical setting, the scores achieved by the BERT_4EVER team variety, and cohesion. Conversely, the bot.zen team was significantly surpass the baseline. This further highlights unable to compute features capturing text complexity for the potential of language models, particularly in the L1 Spanish, resulting in lower scores for that language. classification scenario, as previously mentioned.

These results reflect also specific properties of the two As a final remark, it is worth discussing the perforlearning scenarios of LangLearn, which clearly afected mance of ExtremITA systems. This team employed two all systems submitted to the shared task. As observed Large Language Models to tackle all shared tasks proby [27], the evolution of writing abilities in a second lan- posed in the EVALITA 2023 campaign and explored the guage shows greater variation in terms of style within applicability of a single model in solving multiple difera shorter time period compared to a first language. We ent tasks. Although extremITLLaMA achieved the top can assume that during the learning phase of an L2, new position in 41% of all EVALITA sub-tasks (i.e., 13 out of 22 linguistic structures are acquired by the students in a sub-tasks) and a top-three placement in 14 sub-tasks, the highly structured schedule dictated by the L2 learning results on LangLearn were just slightly above the baseenvironment, gradually becoming more complex in a line on CItA and below the baseline on COWS-L2H. Such somewhat uniform way. Consequently, the essays pro- a result lays the foundation for an interesting and highly timely discussion on the efectiveness of these large and powerful models on real-world tasks. It appears, in fact, that tasks that are strongly afected by stylistic properties, such as language learning development assessment, still pose challenges to these models.

8. Conclusions

In this report, we introduced LangLearn, the first shared task dedicated to the development of systems able to automatically predict the development of language learning starting from learners’ essays, in two learning scenarios and in a multilingual setting. Analysis of the results from the 9 submitted models indicates that the task of language learning development assessment continues to present numerous unresolved challenges. Notably, models that relied on explicit stylistic features demonstrated superior performance in Spanish as an L2 learning scenario. Conversely, Large Language Models showcased greater efectiveness in Italian as an L1 learning scenario.

These findings shed light on the complex nature of language learning assessment and suggest possible future directions for future evaluation campaigns. On the one hand, by leveraging insights from the LangLearn task, researchers can devise new approaches that incorporate both explicit stylistic features and the strengths of Large Language Models. On the other hand, the comparably lower scores achieved by the ExtremITA in our task seem to prompt a new typology of evaluation campaigns devoted to putting under pressure the potential of Large Language Models, pushing the boundaries of their language comprehension and generation capabilities.

Acknowledgments The authors gratefully acknowledge the support of the PNRR MUR project PE0000013-FAIR.

[18] A. Miaschi, S. Davidson, D. Brunato, F. Dell’Or- Speech Tools for Italian. Final Workshop (EVALITA letta, K. Sagae, C. H. Sanchez-Gutierrez, G. Venturi, 2023), CEUR.org, September 7th-8th 2023, Parma, Tracking the evolution of written language compe- 2023. tence in L2 Spanish learners, in: Proceedings of [28] X. Chen, D. Meurers, CTAP: A web-based tool supBEA, ACL, Online, 2020, pp. 92–101. porting automatic complexity analysis, in: Proceed[19] A. Miaschi, D. Brunato, F. Dell’Orletta, A nlp-based ings of the Workshop on Computational Linguistics stylometric approach for tracking the evolution of for Linguistic Complexity (CL4LC), The COLING l1 written language competence, Journal of Writing 2016 Organizing Committee, Osaka, Japan, 2016, Research 13 (2021) 71–105. pp. 113–119. [20] M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprug- [29] C. D. Hromei, D. Croce, V. Basile, R. Basili, Exnoli, G. Venturi, Evalita 2023: Overview of the 8th tremITA at EVALITA 2023: Multi-task sustainable evaluation campaign of natural language process- scaling to large language models at its extreme, in: ing and speech tools for italian, in: Proceedings M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprugof the Eighth Evaluation Campaign of Natural Lan- noli, G. Venturi (Eds.), Proceedings of the Eighth guage Processing and Speech Tools for Italian. Final Evaluation Campaign of Natural Language ProWorkshop (EVALITA 2023), CEUR.org, Parma, Italy, cessing and Speech Tools for Italian. Final Work2023. shop (EVALITA 2023), CEUR.org, September 7th[21] A. Barbagli, Quanto e come si impara a scrivere nel 8th 2023, Parma, 2023.

corso del primo biennio della scuola secondaria di [30] G. Sarti, M. Nissim, IT5: Large-scale text-to-text primo grado, Nuova Cultura, 2016. pretraining for italian language understanding and [22] S. Davidson, A. Yamada, P. Fernandez Mira, generation, ArXiv preprint 2203.03759 (2022). URL: A. Carando, C. H. Sanchez Gutierrez, K. Sagae, De- https://arxiv.org/abs/2203.03759. veloping NLP tools with a new corpus of learner [31] A. Santilli, Camoscio: An italian instruction-tuned spanish, in: Proceedings of the 12th LRE Confer- llama, https://github.com/teelinsan/camoscio, 2023. ence, ELRA, Marseille, France, 2020, pp. 7240–7245. [32] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. [23] H. Wu, N. Lin, S. Jiang, L. Xiao, BERT_4EVER Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hamat LangLearn: Language development assessment bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, model based on sequential information attention G. Lample, Llama: Open and eficient foundation mechanism, in: M. Lai, S. Menini, M. Polignano, language models, 2023. arXiv:2302.13971. V. Russo, R. Sprugnoli, G. Venturi (Eds.), Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023), CEUR.org,

September 7th-8th 2023, Parma, 2023. [24] T. Wolf, L. Debut, V. Sanh, alii, Transformers: Stateof-the-art natural language processing, in: Proc. of

EMNLP, ACL, Online, 2020, pp. 38–45. [25] E. W. Stemle, M. Tebaldini, F. Bonanni, F. Pellegrino,

P. Brasolin, G. H. Franzini, J.-C. Frey, O. Lopopolo, S. Spina, bot.zen at LangLearn: regressing towards interpretability, in: M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprugnoli, G. Venturi (Eds.), Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023), CEUR.org,

September 7th-8th 2023, Parma, 2023. [26] V. Santucci, F. Santarelli, L. Forti, S. Spina, Automatic classification of text complexity, Applied

Sciences 10 (2020) 7285. [27] M. Barbini, E. Zanoli, C. Chesi, IUSS-Nets at LangLearn: The role of morphosyntactic features in language development assessment, in: M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprugnoli, G. Venturi (Eds.), Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and

[1]

Granger , Error-tagged learner corpora and call: A promising synergy , CALICO journal ( 2003 ) 465 - 480 .

[2]

D. S.

McNamara ,

S. A.

Crossley ,

R. D.

Roscoe ,

L. K.

Allen ,

Dai , A hierarchical classification approach to automated essay scoring , Assessing Writing 23 ( 2015 ) 35 - 59 .

[3]

Deane , T. Quinlan, What automated analyses of corpora can tell us about students' writing skills , Journal of Writing Research 2 ( 2010 ) 151 - 177 .

[4]

Sagae ,

Lavie , B. MacWhinney , Automatic measurement of syntactic development in child language , in: Proceedings of the ACL, ACL , 2005 , pp. 197 - 204 .

[5]

Lu , Automatic measurement of syntactic complexity in child language acquisition , International Journal of Corpus Linguistics 14 ( 2009 ) 3 - 28 .

[6]

Bram ,

Housen , Conceptualizing and measuring short-term changes in l2 writing complexity , Journal of Second Language Writing 26 ( 2014 ) 42 - 65 .

[7]

S. A.

Crossley ,

McNamara , Does writing development equal writing quality? A computational investigation of syntactic complexity in L2 learners , Journal of Second Language Writing 26 ( 2014 ) 66 - 79 .

[8]

Lubetich ,

Sagae , Data-driven measurement of child language development with simple syntactic templates , in: Proceedings of COLING: Technical Papers , 2014 , pp. 2151 - 2160 .

[9]

Crossley , Linguistic features in writing quality and development: An overview , Journal of Writing Research ( 2020 ).

[10]

S. A.

Crossley , D. S. McNamara , Predicting second language writing proficiency: The roles of cohesion and linguistic sophistication , Journal of Research in Reading 35 ( 2012 ) 115 - 135 .

[11]

Vajjala ,

Loo , Automatic CEFR level prediction for Estonian learner text , in: Proceedings of the third workshop on NLP for computer-assisted language learning , 2014 , pp. 113 - 127 .

[12]

Volodina ,

Pilán ,

Alfter , et al., Classification of Swedish learner essays by CEFR levels, in: CALL communities and culture-short papers from EUROCALL, 2016 , pp. 456 - 461 .

[13]

Zilio ,

Wilkens ,

Fairon , An SLA corpus annotated with pedagogically relevant grammatical structures , in: Proceedings of LREC, European Language Resources Association (ELRA) , Miyazaki, Japan, 2018 .

[14]

Sagae , Tracking child language development with neural network language models , Frontiers in Psychology 12 ( 2021 ).

[15]

A. S.

Crossley ,

Weston ,

S. McLain

Sullivan , D. S. McNamara , The development of writing proficiency as a function of grade level: A linguistic analysis ., Written

Communication

, Written Communication , vol. 28 , no. 3 , pp. 282 - 311 ( 2011 ).

[16]

Weiss ,

Meurers , Analyzing linguistic complexity and accuracy in academic language development of german across elementary and secondary school , in: Proceedings of BEA , 2019 , pp. 380 - 393 .

[17]

Kerz ,

Qiao ,

Wiechmann ,

Ströbel , Becoming linguistically mature: Modeling English and German children's writing development across school grades , in: Proceedings of BEA, ACL, Online , 2020 , pp. 65 - 74 .