<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>of the Language Learning Development Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chiara Alzetta</string-name>
          <email>chiara.alzetta@ilc.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dominique Brunato</string-name>
          <email>dominique.brunato@ilc.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felice Dell'Orletta</string-name>
          <email>felice.dellorletta@ilc.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessio Miaschi</string-name>
          <email>alessio.miaschi@ilc.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kenji Sagae</string-name>
          <email>sagae@ucdavis.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudia H. Sánchez-Gutiérrez</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giulia Venturi</string-name>
          <email>giulia.venturi@ilc.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ItaliaNLP Lab, CNR, Istituto di Linguistica Computazionale 'A.Zampolli'</institution>
          ,
          <addr-line>Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Processing and Speech Tools for Italian</institution>
          ,
          <addr-line>Sep 7 - 8, Parma, IT</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>The Language Learning Development Task</institution>
          ,
          <addr-line>LangLearn</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>The assessment of language development is cast in Lan-</institution>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University of California</institution>
          ,
          <addr-line>Davis</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>written language competence of L2 Spanish and L1 Italian</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>Language Learning Development (LangLearn) is the EVALITA 2023 shared task on automatic language development assessment, which consists in predicting the evolution of the written language abilities of learners across time. LangLearn is conceived to be multilingual, relying on written productions of Italian and Spanish learners, and representative of L1 and L2 learning scenarios. A total of 9 systems were submitted by 5 teams. The results highlight the open challenges of automatic language development assessment. Language learning development, student essays, shared task, multilingual language learning assessment learners' language and to study how it evolves over time, overall writing ability and its development during later theoretical considerations into educational applications, on English. Few exceptions are represented by e.g. the such as Intelligent Computer-Assisted Language Learn-</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Over the last twenty years, there has been a growing</title>
        <p>
          interest in exploiting the potential of Natural Language
Processing (NLP) tools to characterize the properties of
both in first (L1) and second language (L2) acquisition
scenarios. A similar concern has been paid to turning
ing (ICALL) systems [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and tools for automatically
scoring learners’ writing with respect to language proficiency
and writing quality [
          <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
          ], and more generally systems
able to automatically assign a learner’s language
productionalize sophisticated metrics of language development
thus alleviating the laborious manual computation of
these metrics by experts [
          <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
          ].
        </p>
        <p>
          Generally, a greater number of studies has been carried
out in the field of L2 learning where the study of L2
writings is seen as a proxy for language ability development
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. In this respect, much work is devoted to predicting
the degree of L2 proficiency according to expert-based
ical structures’ competence with respect to predefined
grades, e.g. the Common European Framework of
RefEVALITA 2023: 8th Evaluation Campaign of Natural Language
the same student, a document   should have a higher
tion to a given developmental level [
          <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
          ] or to opera- learners, respectively.
evaluation [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] or to modeling the evolution of grammat- and Spanish learners, and representative of L1 and L2
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Datasets</title>
      <p>quality level with respect to the ones written previously
(  ). Specifically, we followed the approach devised by
[19]: given a randomly ordered pair of essays ( 1,  2)
written by the same student, we ask to predict whether  2
was written before  1.</p>
      <p>LangLearn was articulated in two sub-tasks based on
the resources allowed for training the models.</p>
      <p>In line with the aim of having a multilingual shared task,
we distributed two datasets composed of essays written
by learners of the Italian and Spanish languages.
Notably, the two datasets reflect an additional dimension of
variation, which is the diferent learning scenarios from
which the written productions were obtained.
Specifically, the collection of Italian essays was written by
students learning Italian as their first language, while the
Spanish essays were produced by L2 learners.</p>
      <p>For each corpus, LangLearn participants were provided
with two files:
i.e. reflexive, narrative, descriptive, expository and
argumentative.</p>
      <p>For the purposes of the LangLearn shared task, we
selected a subset of 882 essays authored by 133 diferent
students at diferent time intervals. A time interval is
identified by the year and specific period during which
each essay was produced (e.g., the label 1_4 denotes the
fourth essay written during the first year). Specifically,
• Sub-task 1 consists in predicting the order of es- we considered 11 intervals, six for the first year and five
says using only the oficial training data released for the second one. As it can be seen in Figure 1, the
for the task; essays feature diverse linguistic characteristics across
• Sub-task 2 consists in predicting the order of es- the considered time intervals1. In fact, essays written
says using information acquired from the training in the first year tend to be shorter in terms of the total
data released for the task and also from additional number of tokens than those produced in the second
external resources. year. Interestingly, the length of the document is a raw
text feature highly related to various linguistic aspects
that shape the writing style of an essay. Furthermore,
the essays are increasingly lexically richer across time,
as emerged from the Type/token ratio (TTR) values
calculated for the first 100 tokens of the texts. It is worth
noting that the last essays of the second year (interval
11) deviate from this trend. This is possibly due to the
fact that they are mostly related to similar prompts that
involved completing a history. In this case, students tend
to write shorter and less lexically varied essays.</p>
      <p>In order to build the training and test sets of LangLearn,
essays from each student were paired based on their
chronological order of writing, ensuring that the first
essay in each pair was written prior to the second. This
process resulted in 2,673 essay pairs: 2,366 were assigned
• a .tsv file containing the following information to the train set, and the remaining 307 were placed in
pertaining to a pair of essays ( 1,  2) written by the test set. The distribution of pairs across time
interthe same student: IDs of  1 and  2 in the correct vals is reported in Table 1. Note that some time interval
chronological order, and  1 and  2 corresponding pairs (e.g. 1_2 − 1_5) appear only in the test set. This is
to the time of writing of  1 and  2. done to challenge participants since they do not have any
• an XML file containing the text of the essays with corresponding pairs within the train set. Similarly, we
randomly generated document IDs, as in the ex- isolated 4 students whose essays appear only in the test
ample below: set, while the essays of 49 students appear only in the
&lt;dataset&gt; train set and the essays of 80 students appear in both sets.
&lt;doc id="9843"&gt; Essay &lt;/doc&gt; Indeed, it is possible for the same essay to appear in both
&lt;doc id="7432"&gt; Essay &lt;/doc&gt; the training and test sets, but it would appear in diferent
&lt;/dataset&gt; pairs, ensuring that a specific pair occurs exclusively in
either the train or test set.</p>
      <sec id="sec-2-1">
        <title>3.1. Corpus Italiano di Apprendenti L1</title>
        <p>(CItA)</p>
      </sec>
      <sec id="sec-2-2">
        <title>3.2. Corpus of Written Spanish of L2 and</title>
      </sec>
      <sec id="sec-2-3">
        <title>Heritage Speakers (COWS-L2H)</title>
        <p>CItA (Corpus Italiano di Apprendenti L1) [21] is a longi- The COWS-L2H (Corpus of Written Spanish of L2 and
Hertudinal corpus of essays written by the same L1 Italian itage Speakers) corpus [22] consists of 3,498 short essays
students in the first (2012-2013) and second year (2013- written by second language (L2) students enrolled in one
2014) of lower secondary school. The original corpus of ten lower-division Spanish courses at a single
Americontains a total of 1,352 essays written by 156 students. can university. Student compositions in the corpus are
The essays belong to five textual typologies, which reflect
the diferent prompts students were asked to respond to,</p>
        <sec id="sec-2-3-1">
          <title>1Note that these two linguistic characteristics are those used to</title>
          <p>compute the baseline scores.
written in response to one of six writing prompts, which pairs, and the test set contains 320 essay pairs. The time
are changed periodically. According to these prompts, in interval between essays in a pair usually consists of one,
each essay, the student is asked to write about: a famous two or three academic terms, with each term
correspondperson, your perfect vacation plan, a special person in ing to 10 weeks of courses (Table 2). It is important to
your life, a terrible story, a description of yourself, or a note that these intervals are not easily comparable across
beautiful story. During each period (an academic quarter, datasets, since COWS-L2H deals with highly structured
which consists of ten weeks of courses) of data collection, L2 instruction, which progresses diferently from L1
writstudents are asked to submit two compositions, approxi- ing.
mately one month apart, in response to the previously
mentioned prompts. These composition themes are
designed to be relatively broad, to allow for a wide degree 4. Evaluation
of creative liberty and open-ended interpretation by the Baseline The baseline scores were calculated by
trainwriter. ing a LinearSVM using, for each pair ( 1,  2), the number</p>
          <p>To select essays from the original COWS-L2H dataset of tokens per document (in each pair) and the type/token
for the LangLearn task, we considered only essays writ- ratio of the first 100 tokens in each document as input
ten by students who wrote essays in two separate aca- features.
demic terms. This way, we can pair essays written at
diferent points in time by the same student. To reduce
the possibility that factors independent of language learn- Metrics The models’ performance achieved on the
ing could systematically diferentiate between essays in CItA and COW-L2H test sets have been evaluated
ina pair, we considered only pairs of essays written in re- dependently using Accuracy (A) and F1-score (F-score).
sponse to the same prompt. With these constraints, we
were left with 1,329 pairs of essays written by 440 stu- 5. Submitted Systems and
dents. To split these essay pairs into training and test sets,
we selected the essays written by 330 students to be in Participants
the training set, and the essays written by the remaining
110 students to be in the test set. This means that, in
contrast with the CItA dataset used in LangLearn, there is
no overlap in essays or authors between the training and
test sets. The resulting training set contains 1,009 essay
Following a call for interest, 5 teams registered for the
task and submitted their predictions for both datasets, for
a total of 18 runs (namely, 9 for each language tackled in
the shared task). Eventually, one team (i.e.
aroyehun_angel) did not submit a system report, thus we included
1_1
1_2
1_3
1_4
1_5
1_6
2_1
2_2
2_3
2_4
1_1
1_2
1_3
1_4
1_5
1_6
2_1
2_2
2_3
2_4
their scores in the overall dashboard, but we excluded
them from the system description and error analyses. As
shown in Table 3, all teams participated only in sub-task
1.</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>BERT_4EVER [23] proposed three diferent systems</title>
          <p>based on the base Italian BERT2 model [24]. For
finetuning the models, the team augmented the CItA and
COWS-L2H datasets by reversing essay pairs to obtain
negative examples and generating new positive examples
by constructing transitive pairs. In the first system, BERT,
BERT was fine-tuned performing simultaneous training
on the augmented CItA and COWS-L2H datasets. The
second model, Sequential, employs a novel sequential
information attention mechanism to capture the
interaction between the essays in a pair, which allows for
incorporating the attention weights derived from the
lastwritten essay in the representation of the pair relying
on the [CLS] token and using average pooling. This pair
representation is then fed into a linear classifier with a
softmax function. The third model proposed is the Merge
one, which fuses BERT and Sequential by averaging their
output probabilities.</p>
        </sec>
        <sec id="sec-2-3-3">
          <title>2https://huggingface.co/dbmdz/bert-base-italian-uncased</title>
          <p>bot.zen [25] tackled LangLearn as a regression
problem, where the goal was to determine the stage of the
learning process at which a student wrote a text. To
achieve this, the team first pre-processed the oficial
training sets in order to acquire the absolute order of each
essay written by a student. Then, they performed
predictions relying on an ensemble of decision tree algorithms.
The model was trained using 125 normalised features
capturing lexical and morpho-syntactic properties for
each essay. By using MALT-IT2 [26], the team was able
to include a set of features measuring text complexity
in terms of document length, and lexical, syntactic, and
morpho-syntactic properties. These features, however,
are available only for the Italian language, thus they were
used only for CItA predictions.
BERT_4EVER
aroyehun_angel
bot.zen
IUSSnets
ExtremITA</p>
          <p>Members
4
IUSS-Nets [27] approached LangLearn using
linguistics features (e.g. density of various part-of-speech
categories, frequency of diferent kinds of syntactic
constituents, mean sentence length, etc.) extracted using the
existing Common Text Analysis Platform, or CTAP [28],
and surprisal-based metrics derived from token
probabilities obtained using pretrained language-specific BERT
models. These diferent pieces of information were
encoded in features used in random forest classifiers.
Interestingly, unlike most systems in LangLearn, which
obtained better performance on the CItA dataset than on
COWS-L2H, this approach produced higher accuracy and
F-Score on COWS-L2H. In fact, it produced the strongest
results on the COWS-L2H dataset among those
submitted. Although its performance on CItA was not among
the strongest submitted, it was still substantially above
the baseline.</p>
        </sec>
        <sec id="sec-2-3-4">
          <title>ExtremITA [29] team participated in the task with two</title>
          <p>Language Models trained in a multi-task learning
framework. The first model is an encoder-decoder based on
IT5-small [30], while the second model was a decoder
based on Camoscio [31], the Italian version of LLaMA Table 4
[32]. These models show substantial diferences in terms LangLearn shared task leaderboard.
of parameter count, with IT5-small comprising around
110 million parameters, whereas the utilized version of
Camoscio encompasses 7 billion parameters. Both
models underwent joint fine-tuning on all EVALITA 2023 the dataset, inverted sentence pairs were incorporated,
tasks and sub-tasks, leveraging prompting techniques. resulting in an expansion of the dataset from 3,377 to
Specifically, for the LangLearn task, the extremIT5 model 6,438 examples.
received each instance of the dataset with the task name
preceding it as input, and it produced the predicted la- 6. Results
bel as output. Conversely, the extremITLLaMa model,
which requires a structured prompt, was provided with Table 4 reports the leaderboard of systems
participata textual description of the task and the desired output ing in the LangLearn shared task. Most systems
outperformat specification, as follows: “Questi due testi sepa- formed the baseline when tested on CItA dataset while
rati da [SEP] sono presentati nell’ordine in cui sono scritti? surpassing the baseline proved to be more challenging on
Rispondi sì o no”. As regards the dataset treatment, some COWS-L2H dataset. The team BERT_4EVER submitted
preprocessing steps were adopted: firstly, the dataset was the best-performing systems in the L1 scenario, while
segmented into sentences, allowing a maximum of 100 the highest score for the Spanish dataset was achieved
tokens per sentence. Additionally, in order to augment by the IUSS-Nets team. ExtremITA obtained the lowest
CItA
Team
BERT_4EVER-BERT
BERT_4EVER-Merge
BERT_4EVER-Sequential
aroyehun_angel-system2
aroyehun_angel-system1
bot.zen
IUSS-Nets
ExtremITA-camoscio-lora
Baseline
ExtremITA-it5
scores on both datasets.</p>
          <p>Overall, we observe varying system rankings across
the two learning scenarios. We discuss such variation in
more depth in the next Section.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>7. Discussion</title>
      <sec id="sec-3-1">
        <title>Upon examination of system performance, we notice differences in model performance between the CItA and Table 5</title>
        <p>COWS-L2H datasets. Considering that each dataset re- Results on the CItA Test set considering only unseen students.
lfects a diferent learning scenario, this might indicate
that the challenges posed by these scenarios were distinct.</p>
        <p>One notable finding is that models leveraging stylistic duced by L2 learners may primarily serve as a measure
properties of essays, such as the IUSS-Nets model, were of their progress in acquiring these new, more complex
more efective in the L2 setting. On the other hand, teams structures. On the other hand, L1 learners may face
chalthat employed Neural Language Models achieved higher lenges from their teachers to enhance their proficiency in
results on the CItA dataset. accurately using linguistic structures they have already</p>
        <p>The observed diferences in performance might be at- acquired. As a consequence, L2 essays may exhibit more
tributed to two main factors: model architectures and significant stylistic variations as learners are faced with
specific properties of the two learning scenarios. Con- the acquisition of new language structures. In contrast,
cerning the former, we highlight, for instance, that the L1 essays over time may show a more accurate use of
BERT model used by the BERT_4EVER team was pre- already familiar linguistic structures, highlighting the
trained only on Italian texts. This choice likely con- learners’ mastery of these elements.
tributed to its lower performance on COWS-L2H, despite To deepen our analyses on the CItA dataset, we
comthe simultaneous fine-tuning on both CItA and COWS- pared the system performance on a subset of essay pairs
L2H. In fact, while BERT was the best-performing model that correspond to the most challenging prediction
sceof the BERT_4EVER team and overall on CItA, it was nario, i.e. considering pairs involving students whose
essurpassed on Spanish essays by their Sequential model, says appear only in the test set. The results on this subset
which incorporates information about the interaction are reported in Table 5. As can be noted, the system
rankbetween the essays in a pair. Similar observations can be ing remains unvaried, but the bot.zen and BERT_4EVER
made for the bot.zen and IUSS-Nets teams. Both teams systems sufer a drop in their performance on this
setemployed classification models that leverage a set of ex- ting. The main cause of the decline in scores is due to
plicit features capturing linguistic properties of the texts. the increased complexity of this particular setting. In
While both teams exploited features measuring raw text fact, systems cannot rely on information extracted from
properties and the distribution of part-of-speech and syn- essays present both in the training and test sets, although
tactic dependencies for both languages, they difered in paired with diferent essays. As a result, the systems must
terms of features that captured deeper textual properties. rely solely on their generalization abilities to discern
sigSpecifically, IUSS-Nets achieved the highest score on the nificant variations within each essay pair. However, it
COWS-L2H dataset thanks to a wide set of features mea- is important to acknowledge that even in this particular
suring text complexity, sophistication, refinement, lexical setting, the scores achieved by the BERT_4EVER team
variety, and cohesion. Conversely, the bot.zen team was significantly surpass the baseline. This further highlights
unable to compute features capturing text complexity for the potential of language models, particularly in the L1
Spanish, resulting in lower scores for that language. classification scenario, as previously mentioned.</p>
        <p>These results reflect also specific properties of the two As a final remark, it is worth discussing the
perforlearning scenarios of LangLearn, which clearly afected mance of ExtremITA systems. This team employed two
all systems submitted to the shared task. As observed Large Language Models to tackle all shared tasks
proby [27], the evolution of writing abilities in a second lan- posed in the EVALITA 2023 campaign and explored the
guage shows greater variation in terms of style within applicability of a single model in solving multiple
difera shorter time period compared to a first language. We ent tasks. Although extremITLLaMA achieved the top
can assume that during the learning phase of an L2, new position in 41% of all EVALITA sub-tasks (i.e., 13 out of 22
linguistic structures are acquired by the students in a sub-tasks) and a top-three placement in 14 sub-tasks, the
highly structured schedule dictated by the L2 learning results on LangLearn were just slightly above the
baseenvironment, gradually becoming more complex in a line on CItA and below the baseline on COWS-L2H. Such
somewhat uniform way. Consequently, the essays pro- a result lays the foundation for an interesting and highly
timely discussion on the efectiveness of these large and
powerful models on real-world tasks. It appears, in fact,
that tasks that are strongly afected by stylistic
properties, such as language learning development assessment,
still pose challenges to these models.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>8. Conclusions</title>
      <p>In this report, we introduced LangLearn, the first shared
task dedicated to the development of systems able to
automatically predict the development of language learning
starting from learners’ essays, in two learning scenarios
and in a multilingual setting. Analysis of the results from
the 9 submitted models indicates that the task of language
learning development assessment continues to present
numerous unresolved challenges. Notably, models that
relied on explicit stylistic features demonstrated
superior performance in Spanish as an L2 learning scenario.
Conversely, Large Language Models showcased greater
efectiveness in Italian as an L1 learning scenario.</p>
      <p>These findings shed light on the complex nature of
language learning assessment and suggest possible
future directions for future evaluation campaigns. On the
one hand, by leveraging insights from the LangLearn
task, researchers can devise new approaches that
incorporate both explicit stylistic features and the strengths
of Large Language Models. On the other hand, the
comparably lower scores achieved by the ExtremITA in our
task seem to prompt a new typology of evaluation
campaigns devoted to putting under pressure the potential of
Large Language Models, pushing the boundaries of their
language comprehension and generation capabilities.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>The authors gratefully acknowledge the support of the PNRR MUR project PE0000013-FAIR.</title>
        <p>[18] A. Miaschi, S. Davidson, D. Brunato, F. Dell’Or- Speech Tools for Italian. Final Workshop (EVALITA
letta, K. Sagae, C. H. Sanchez-Gutierrez, G. Venturi, 2023), CEUR.org, September 7th-8th 2023, Parma,
Tracking the evolution of written language compe- 2023.
tence in L2 Spanish learners, in: Proceedings of [28] X. Chen, D. Meurers, CTAP: A web-based tool
supBEA, ACL, Online, 2020, pp. 92–101. porting automatic complexity analysis, in:
Proceed[19] A. Miaschi, D. Brunato, F. Dell’Orletta, A nlp-based ings of the Workshop on Computational Linguistics
stylometric approach for tracking the evolution of for Linguistic Complexity (CL4LC), The COLING
l1 written language competence, Journal of Writing 2016 Organizing Committee, Osaka, Japan, 2016,
Research 13 (2021) 71–105. pp. 113–119.
[20] M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprug- [29] C. D. Hromei, D. Croce, V. Basile, R. Basili,
Exnoli, G. Venturi, Evalita 2023: Overview of the 8th tremITA at EVALITA 2023: Multi-task sustainable
evaluation campaign of natural language process- scaling to large language models at its extreme, in:
ing and speech tools for italian, in: Proceedings M. Lai, S. Menini, M. Polignano, V. Russo, R.
Sprugof the Eighth Evaluation Campaign of Natural Lan- noli, G. Venturi (Eds.), Proceedings of the Eighth
guage Processing and Speech Tools for Italian. Final Evaluation Campaign of Natural Language
ProWorkshop (EVALITA 2023), CEUR.org, Parma, Italy, cessing and Speech Tools for Italian. Final
Work2023. shop (EVALITA 2023), CEUR.org, September
7th[21] A. Barbagli, Quanto e come si impara a scrivere nel 8th 2023, Parma, 2023.</p>
        <p>corso del primo biennio della scuola secondaria di [30] G. Sarti, M. Nissim, IT5: Large-scale text-to-text
primo grado, Nuova Cultura, 2016. pretraining for italian language understanding and
[22] S. Davidson, A. Yamada, P. Fernandez Mira, generation, ArXiv preprint 2203.03759 (2022). URL:
A. Carando, C. H. Sanchez Gutierrez, K. Sagae, De- https://arxiv.org/abs/2203.03759.
veloping NLP tools with a new corpus of learner [31] A. Santilli, Camoscio: An italian instruction-tuned
spanish, in: Proceedings of the 12th LRE Confer- llama, https://github.com/teelinsan/camoscio, 2023.
ence, ELRA, Marseille, France, 2020, pp. 7240–7245. [32] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A.
[23] H. Wu, N. Lin, S. Jiang, L. Xiao, BERT_4EVER Lachaux, T. Lacroix, B. Rozière, N. Goyal, E.
Hamat LangLearn: Language development assessment bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave,
model based on sequential information attention G. Lample, Llama: Open and eficient foundation
mechanism, in: M. Lai, S. Menini, M. Polignano, language models, 2023. arXiv:2302.13971.
V. Russo, R. Sprugnoli, G. Venturi (Eds.),
Proceedings of the Eighth Evaluation Campaign of
Natural Language Processing and Speech Tools for
Italian. Final Workshop (EVALITA 2023), CEUR.org,</p>
        <p>September 7th-8th 2023, Parma, 2023.
[24] T. Wolf, L. Debut, V. Sanh, alii, Transformers:
Stateof-the-art natural language processing, in: Proc. of</p>
        <p>EMNLP, ACL, Online, 2020, pp. 38–45.
[25] E. W. Stemle, M. Tebaldini, F. Bonanni, F. Pellegrino,</p>
        <p>P. Brasolin, G. H. Franzini, J.-C. Frey, O. Lopopolo,
S. Spina, bot.zen at LangLearn: regressing towards
interpretability, in: M. Lai, S. Menini, M. Polignano,
V. Russo, R. Sprugnoli, G. Venturi (Eds.),
Proceedings of the Eighth Evaluation Campaign of
Natural Language Processing and Speech Tools for
Italian. Final Workshop (EVALITA 2023), CEUR.org,</p>
        <p>September 7th-8th 2023, Parma, 2023.
[26] V. Santucci, F. Santarelli, L. Forti, S. Spina,
Automatic classification of text complexity, Applied</p>
        <p>Sciences 10 (2020) 7285.
[27] M. Barbini, E. Zanoli, C. Chesi, IUSS-Nets at
LangLearn: The role of morphosyntactic features in
language development assessment, in: M. Lai,
S. Menini, M. Polignano, V. Russo, R. Sprugnoli,
G. Venturi (Eds.), Proceedings of the Eighth
Evaluation Campaign of Natural Language Processing and</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Granger</surname>
          </string-name>
          ,
          <article-title>Error-tagged learner corpora and call: A promising synergy</article-title>
          ,
          <source>CALICO journal</source>
          (
          <year>2003</year>
          )
          <fpage>465</fpage>
          -
          <lpage>480</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>McNamara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Crossley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. D.</given-names>
            <surname>Roscoe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. K.</given-names>
            <surname>Allen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <article-title>A hierarchical classification approach to automated essay scoring</article-title>
          ,
          <source>Assessing Writing</source>
          <volume>23</volume>
          (
          <year>2015</year>
          )
          <fpage>35</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Deane</surname>
          </string-name>
          , T. Quinlan,
          <article-title>What automated analyses of corpora can tell us about students' writing skills</article-title>
          ,
          <source>Journal of Writing Research</source>
          <volume>2</volume>
          (
          <year>2010</year>
          )
          <fpage>151</fpage>
          -
          <lpage>177</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sagae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. MacWhinney</surname>
          </string-name>
          ,
          <article-title>Automatic measurement of syntactic development in child language</article-title>
          ,
          <source>in: Proceedings of the ACL, ACL</source>
          ,
          <year>2005</year>
          , pp.
          <fpage>197</fpage>
          -
          <lpage>204</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Automatic measurement of syntactic complexity in child language acquisition</article-title>
          ,
          <source>International Journal of Corpus Linguistics</source>
          <volume>14</volume>
          (
          <year>2009</year>
          )
          <fpage>3</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Bram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Housen</surname>
          </string-name>
          ,
          <article-title>Conceptualizing and measuring short-term changes in l2 writing complexity</article-title>
          ,
          <source>Journal of Second Language Writing</source>
          <volume>26</volume>
          (
          <year>2014</year>
          )
          <fpage>42</fpage>
          -
          <lpage>65</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Crossley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>McNamara</surname>
          </string-name>
          ,
          <article-title>Does writing development equal writing quality? A computational investigation of syntactic complexity in L2 learners</article-title>
          ,
          <source>Journal of Second Language Writing</source>
          <volume>26</volume>
          (
          <year>2014</year>
          )
          <fpage>66</fpage>
          -
          <lpage>79</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lubetich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sagae</surname>
          </string-name>
          ,
          <article-title>Data-driven measurement of child language development with simple syntactic templates</article-title>
          ,
          <source>in: Proceedings of COLING: Technical Papers</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>2151</fpage>
          -
          <lpage>2160</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Crossley</surname>
          </string-name>
          ,
          <article-title>Linguistic features in writing quality and development: An overview</article-title>
          ,
          <source>Journal of Writing Research</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Crossley</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. S. McNamara</surname>
          </string-name>
          ,
          <article-title>Predicting second language writing proficiency: The roles of cohesion and linguistic sophistication</article-title>
          ,
          <source>Journal of Research in Reading 35</source>
          (
          <year>2012</year>
          )
          <fpage>115</fpage>
          -
          <lpage>135</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Vajjala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Loo</surname>
          </string-name>
          ,
          <article-title>Automatic CEFR level prediction for Estonian learner text</article-title>
          ,
          <source>in: Proceedings of the third workshop on NLP for computer-assisted language learning</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>113</fpage>
          -
          <lpage>127</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E.</given-names>
            <surname>Volodina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Pilán</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Alfter</surname>
          </string-name>
          , et al.,
          <article-title>Classification of Swedish learner essays by CEFR levels, in: CALL communities and culture-short papers from</article-title>
          EUROCALL,
          <year>2016</year>
          , pp.
          <fpage>456</fpage>
          -
          <lpage>461</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zilio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wilkens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fairon</surname>
          </string-name>
          ,
          <article-title>An SLA corpus annotated with pedagogically relevant grammatical structures</article-title>
          ,
          <source>in: Proceedings of LREC, European Language Resources Association (ELRA)</source>
          , Miyazaki, Japan,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sagae</surname>
          </string-name>
          ,
          <article-title>Tracking child language development with neural network language models</article-title>
          ,
          <source>Frontiers in Psychology</source>
          <volume>12</volume>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Crossley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. McLain</given-names>
            <surname>Sullivan</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. S. McNamara</surname>
          </string-name>
          ,
          <article-title>The development of writing proficiency as a function of grade level: A linguistic analysis</article-title>
          .,
          <string-name>
            <surname>Written</surname>
            <given-names>Communication</given-names>
          </string-name>
          ,
          <source>Written Communication</source>
          , vol.
          <volume>28</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>282</fpage>
          -
          <lpage>311</lpage>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Meurers</surname>
          </string-name>
          ,
          <article-title>Analyzing linguistic complexity and accuracy in academic language development of german across elementary and secondary school</article-title>
          ,
          <source>in: Proceedings of BEA</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>380</fpage>
          -
          <lpage>393</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>E.</given-names>
            <surname>Kerz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wiechmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ströbel</surname>
          </string-name>
          ,
          <article-title>Becoming linguistically mature: Modeling English and German children's writing development across school grades</article-title>
          ,
          <source>in: Proceedings of BEA, ACL, Online</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>