<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>bot.zen at LangLearn: regressing towards interpretability</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Egon W. Stemle</string-name>
          <email>egon.stemle@eurac.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martina Tebaldini</string-name>
          <email>martina.tebaldini@student.unibz.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesca Bonanni</string-name>
          <email>francesca.bonanni@student.unibz.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Filippo Pellegrino</string-name>
          <email>filippo.pellegrino@student.unibz.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Brasolin</string-name>
          <email>paolo.brasolin@eurac.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Greta H. Franzini</string-name>
          <email>greta.franzini@eurac.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jennifer-Carmen Frey</string-name>
          <email>jennifercarmen.frey@eurac.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olga Lopopolo</string-name>
          <email>olga.lopopolo@eurac.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefania Spina</string-name>
          <email>stefania.spina@eurac.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Informatics, Masaryk University</institution>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute for Applied Linguistics, Eurac Research</institution>
          ,
          <addr-line>Viale Druso, 1, 39100 Bolzano (BZ)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Università per Stranieri di Perugia</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Bolzano</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>cessing and Speech Tools for Italian</institution>
          ,
          <addr-line>Sep 7 - 8, Parma, IT</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This article describes the bot.zen system that participated in the Language Learning Development (LangLearn) shared task of the EVALITA 2023 campaign. We developed a simple machine learning system with good interpretability for later use, and used the shared task as an opportunity to provide Master's students with hands-on training and practical experience in NLP. sociolinguistic metadata (information about the author), opment (LangLearn) shared task (ST) on automatic lan0000-0002-7008-6394 (J. Frey); 0000-0003-0997-367X (O. Lopopolo); tion 4 concludes with a short discussion.</p>
      </abstract>
      <kwd-group>
        <kwd>system description</kwd>
        <kwd>langlearn</kwd>
        <kwd>evalita</kwd>
        <kwd>shared task</kwd>
        <kwd>regression</kwd>
        <kwd>MALT-IT2</kwd>
        <kwd>bot</kwd>
        <kwd>zen</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>There has been an increasing interest in using Natural</title>
      </sec>
      <sec id="sec-1-2">
        <title>Language Processing (NLP) tools and machine learning</title>
        <p>techniques to analyse writing development in first (L1)
and second language (L2) acquisition settings. The topic
has been explored in Second Language Acquisition (SLA),</p>
      </sec>
      <sec id="sec-1-3">
        <title>Learner Corpus Research (LCR) (e.g. [1]), Corpus Linguistics and in writing development research (e.g. [2]), and its goal is to understand how specific features can reflect writing quality and development.</title>
        <p>The analysis of language learner data typically spans
linguistic data (information extracted from the text) and
textual metadata (information about the text). According
to [3], metadata such as reading time, geographic
factors, and parents’ occupation level can have an impact
on language skill development, whereas [4] finds writing
quality and development to be influenced by both text
LGOBE
(S. Spina)
https://iiegn.eu (E. W. Stemle)
0000-0002-7655-5526 (E. W. Stemle); 0000-0003-2471-7797
(P. Brasolin); 0000-0003-1159-5575 (G. H. Franzini);
diversity and sophistication, as well as syntactic
complexity and text cohesion. Finally, a text usually includes
metadata such as the author, the date of creation, the
context in which it was written, and a language
proficiency rating. This contextual information enhances the
overall understanding of the content. All of these
research strands can support NLP applications for writing
evaluation and assessment, including automatic essay
scoring, automatic writing evaluation systems, and
automatic classification of text dificulty for learners. (For an
in-depth overview and additional references, see [4].)</p>
      </sec>
      <sec id="sec-1-4">
        <title>At EVALITA 2023 [5], the Language Learning Devel</title>
        <p>guage development assessment [6] consisted in
predicting the relative order of two essays written by the same
student. More specifically, the texts provided were in</p>
      </sec>
      <sec id="sec-1-5">
        <title>Italian and Spanish, and came with only a very limited</title>
        <p>set of metadata. We participated in this ST to acquire
experience with this type of data, and as an opportunity
sity of Bolzano1 in NLP scientific work through practical</p>
        <p>Our system relies only on the data provided for the ST,
generates explicit information about students’ progress
out of implicit information in the data and uses regression
without Large Language Models (LLMs) or Neural
Networks (NNs) with features from an external tool
specifically designed for Italian texts. As a result, our system
performed well on Italian but poorly on Spanish data.</p>
      </sec>
      <sec id="sec-1-6">
        <title>The rest of the paper is organised as follows: Sec</title>
        <p>tion 2 describes the system design and implementation;</p>
      </sec>
      <sec id="sec-1-7">
        <title>Section 3 describes our experiments and results; and Sec</title>
        <p>length and linguistic features including lexical density, to involve and train Master’s students from the
UniverEVALITA 2023: 8th Evaluation Campaign of Natural Language
Pro</p>
        <p>master-applied-linguistics/
languages, including Italian and Spanish2.</p>
        <p>After tokenisation, we collect 1- to 3-grams of the word
forms and the part-of-speech tags. We additionally
collect 2-grams of the morphological analyses of the words
and 1-grams of a word’s dependency relation. Overall,
this amounts to roughly 17,000 features per document.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. System Design and</title>
    </sec>
    <sec id="sec-3">
      <title>Implementation</title>
      <sec id="sec-3-1">
        <title>Our objective was to develop a simple machine learning</title>
        <p>system with good interpretability. Therefore, we
prioritised a simple design that could provide transparent
explanations for its decision-making process over
complex implementation and high predictive performance.</p>
        <sec id="sec-3-1-1">
          <title>2.1. Data Pre-processing</title>
          <p>In a first processing step, we restructure the given ST
data, which provides essay ids with their respective time
of writing in tabular format, as shown in Figure 1.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>2.2. Feature Extraction</title>
          <p>We use spaCy [7] and MALT-IT2 [8] in order to transform
the raw input data into a meaningful set of informative
features, as they provide easy-to-use and reliable feature
extraction methods.
2.2.1. spaCy
spaCy is an open-source NLP library in Python providing
tools for many tasks and pre-trained models for several
2We use the it_core_news_lg and es_core_news_lg models.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>6. Discursive Features take into account the cohe</title>
        <p>sive structure of a text.</p>
        <p>MALT-IT2 has to be invoked externally to process text
ifles into a comma-separated values (CSV) file containing
one line per document within its feature space; the CSV
ifle is subsequently ingested by our system without any
additional interaction or knowledge of MALT-IT2. This
means that we can swap out MALT-IT2 with a diferent
system or add another system capable of producing a
document-feature-matrix in CSV format.
2.2.3. CTAP</p>
      </sec>
      <sec id="sec-3-3">
        <title>We also experimented with a version of the Common</title>
        <p>Text Analysis Platform (CTAP) [10] adapted for Italian
text [11]. Much like MALT-IT2, CTAP is a linguistic
complexity measurement tool ofering various statistics and
features to analyse text complexity in terms of length,
lexical, syntactic and morpho-syntactic aspects.
Unfortunately, we encountered some problems while processing
the entire dataset. Very short texts, for instance, caused
CTAP to end prematurely with no error message, leaving
us with no choice but to exclude CTAP features from
our system. CTAP is capable of producing a
documentfeature-matrix in CSV format and could have been easily
integrated into our system.
2.3. Processing Pipeline
p i p e l i n e = P i p e l i n e ( s t e p s = [
( ’ c o m b i n e d _ f e a t u r e s ’ ,</p>
        <p>c o m b i n e d _ f e a t u r e s ) ,
( ’ s c a l e r ’ , S t a n d a r d S c a l e r ( ) ) ,
( ’ r e d u x ’ , TruncatedSVD ( 1 2 5 ) ) ,
( ’ e s t i m a t o r ’ , H G B o o s t i n g R e g r e s s o r (
l o s s = ’ s q u a r e d _ e r r o r ’ ) )
space. This space is the combination (concatenation) of
all diferent tools after feature extraction (Section 2.2),
totalling around 17,200 features. To standardise the data,
we use the StandardScaler(), which removes the mean
and scales it to unit variance. We also reduce the linear
dimensions using the TruncatedSVD() method4. As a
results, our processed dataset consists of 125 features.
Finally, we use the HistGradientBoostingRegressor()
for learning, which is an ensemble method5.</p>
        <p>In order to use our system for the ST, we perform data
post-processing. We convert the output of our regression
model for individual texts into a binary label for pairs of
texts that indicated which of the two was written first.</p>
        <sec id="sec-3-3-1">
          <title>2.4. Optimisation</title>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Our data processing pipeline has been implemented in</title>
        <p>Python and makes use of the pandas and scikit-learn
libraries3.</p>
      </sec>
      <sec id="sec-3-5">
        <title>The diferent parts of our system were optimised towards</title>
        <p>our target variable (absolute position) via an ad-hoc grid
search in 3-fold cross validation (CV) runs.</p>
        <p>The parts we optimised were: the types of spaCy
inpandas [12] is an open-source library for data manip- formation to collect6; n-gram ranges and the minimum
ulation and analysis that integrates well with other li- document frequencies for the spaCy collectors; the type
braries in the Python ecosystem, making it a versatile of dimensionality reduction7 and the number of
dimentool for data analysis and preparation. sions to use; the regression algorithm to use8.</p>
        <p>Our system uses pandas for internal data
representations, manipulations and calculations during data
preprocessing (Section 2.1) and the processing of CSV files.
3. Experiments and Results
scikit-learn [13] is an open-source machine learning
library for Python providing a wide range of algorithms
and tools for various tasks, including classification and
dimensionality reduction. With a user-friendly and
consistent interface, extensive documentation and an
estab</p>
        <sec id="sec-3-5-1">
          <title>3.1. Shared Task (ST)</title>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>The Language Learning Development (LangLearn) ST [6]</title>
        <p>consisted in predicting the relative order of two essays:
given a randomly ordered pair (Essay 1, Essay 2) written
lished user base, scikit-learn makes it easy to implement 4We perform feature reduction to remove noise or irrelevant
informachine learning workflows. mation, and highlight important aspects of the data, enabling the</p>
        <p>Our system uses scikit-learn for the main processing, model to make more accurate predictions.
as illustrated in Figure 3. 5Ensemble methods combine and aggregate predictions of multiple</p>
        <p>The processing pipeline requires a document feature models to improve predictive performance.
matrix that represents all texts as vectors in our feature 6ttookkeenn..tdeexpt_, token.lemma_, token.pos_, token.morph,
7PCA(), TruncatedSVD()
3We used Python 3.8.16, pandas 2.0.1, scikit-learn 1.2.2, 8 DecisionTreeRegressor(), SVR(), KernelRidge(),
and spacy 3.5.3 + it-core-news-lg 3.5.0 for processing. HistGradientBoostingRegressor()
by the same student, the task was to predict whether
Essay 1 had been written before Essay 2.</p>
        <sec id="sec-3-6-1">
          <title>3.2. Shared Task Data</title>
        </sec>
      </sec>
      <sec id="sec-3-7">
        <title>The LangLearn ST data contains essays from two diferent</title>
        <p>corpora, namely CItA [14] and COWS-L2H [15], with
texts in Italian and Spanish, respectively.</p>
        <p>Training data includes information on pairs of texts
written by the same student at diferent times. Each entry
represents the sequence of two essays, and by considering
multiple entries with overlapping text-ids we are able to
recreate the sequence of all texts for each student (see
Section 2.1). The data also contains the texts themselves
but no additional (meta)information beyond this.</p>
      </sec>
      <sec id="sec-3-8">
        <title>CItA The CItA corpus (Corpus Italiano di Apprendenti</title>
        <p>L1) is a collection of Italian essays written by students
learning their first language in seven diferent lower
secondary schools in Rome over the course of two years
(2012-2013 and 2013-2014). The students were asked to
write diferent types of essays, namely reflexive,
narrative, descriptive, expository and argumentative. The ST
data contains 834 of the total 1,352 essays written but
does not provide any information about the type of text.
We also analysed the CItA-part of the ST dataset
independent of our system’s performance. To this end, we
used the original data with texts in Set 1 always written
before texts in Set 2. We then used CTAP to calculate
feature values for all texts in both sets. Afterwards, we
conducted a (paired) t-test to detect features that difered
in their means (as a starting point for later research).</p>
        <p>We found some evidence that Set 1 had a higher
number of ‘basic vocabulary’ words, whereas Set 2 had a
higher number of imageability words. Set 1 also had
higher TTR and HDD (Hypergeometric Distribution D)
measures, but since Set 2 generally had longer texts,
length efects certainly come into play [ 16]. Also, Set
1 used more auxiliary verbs, possibly due to a higher
presence of past participle verbs. The use of connectives
was higher in Set 2, especially for additive and
consequence connectives. The number of dependent clauses
per sentence did not difer significantly between the sets.
Finally, Set 2 contained more sentences and more
punctuation marks but sentence length remained constant.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <sec id="sec-4-1">
        <title>Our system (see Section 2) was relatively simple. Nei</title>
        <p>COWS-L2H The COWS-L2H corpus (Corpus of Writ- ther LLMs nor recurrent neural networks (RNNs) were
ten Spanish of L2 and Heritage Speakers) is a collection of integrated, nor did we use any data other than those
texts created by students of Spanish as a second language provided by the organisers. While our results for the
enrolled at a North American university. The students Italian data were satisfactory, we performed very poorly
were asked to write multiple compositions at diferent on the Spanish data, as expected: MALT-IT2, our main
times throughout the academic quarters, and the essays processing component, was designed for Italian texts
were collected over the course of two years, from 2017 to only, which had a negative impact on our system when
2020. The essays were written by the same students, and processing Spanish data, and despite the baseline system
the ST data contains 1,426 of the original 3,498 essays. information also being encoded in our features, the
presence of too much irrelevant data hampered the overall
3.3. Results performance.</p>
        <p>Nevertheless, the ST served as a great opportunity for
The performance of our system on the two datasets (as Master’s students to gain practical project work
expereported by the ST organisers) was: rience: running into all-too-common data processing,
encoding and decoding dificulties whilst navigating the
acc f-score intricacies of understanding, analysing and evaluating
CItA bot.zen 0.83 0.84 the data for the task at hand. With the help of the
literabest 0.93 0.93 ture suggestions provided by the organisers, the students
baseline 0.55 0.55 were able to develop relevant ideas and provide
targetoriented answers to emerging questions. Although the
acc f-score internship was only 150 hours long and did not include
COWS-L2H bot.zen 0.50 0.52 the implementation of a functional application9, the
stubest 0.75 0.75 dents had the opportunity to familiarise themselves with
baseline 0.66 0.66 the crucial stages of a scientific project, documenting all
steps into a project report, which was partially
incorporated in this paper.</p>
      </sec>
      <sec id="sec-4-2">
        <title>The baseline scores were calculated by training a Lin</title>
        <p>earSVM using the number of tokens per document and
the Type-Token-Ratio (TTR) of the first 100 tokens in
each document as input features.</p>
      </sec>
      <sec id="sec-4-3">
        <title>9Eurac Research took over the task of implementing a functional</title>
        <p>application.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>[9] V. Santucci, F. Santarelli, L. Forti, S. Spina,
Automatic classification of text complexity, Applied
SciWe would like to thank our colleagues Arianna Bienati, ences 10 (2020) 7285. doi:10.3390/app10207285.
Francesco Fernicola and Lionel Nicolas for their support [10] X. Chen, D. Meurers, CTAP: A web-based tool
supduring the project. porting automatic complexity analysis, in:
Proceedings of the Workshop on Computational Linguistics
References for Linguistic Complexity (CL4LC), The COLING
2016 Organizing Committee, Osaka, Japan, 2016, pp.
[1] S. A. Crossley, D. S. McNamara, Does writing de- 113–119. URL: https://aclanthology.org/W16-4113.
velopment equal writing quality? A computational [11] N. Okinina, J.-C. Frey, Z. Weiss, CTAP for
Italinvestigation of syntactic complexity in L2 learn- ian: Integrating components for the analysis of
Italers, Journal of Second Language Writing 26 (2014) ian into a multilingual linguistic complexity
analy66–79. doi:10.1016/j.jslw.2014.09.006. sis tool, in: Proceedings of the Twelfth Language
[2] P. Durrant, M. Brenchley, L. McCallum, Under- Resources and Evaluation Conference, European
standing Development and Proficiency in Writing: Language Resources Association, Marseille, France,
Quantitative Corpus Linguistic Approaches, 1st ed., 2020, pp. 7123–7131. URL: https://aclanthology.org/
Cambridge University Press, 2021. doi:10.1017/ 2020.lrec-1.880.</p>
      <p>9781108770101. [12] The pandas development team, pandas-dev/pandas:
[3] A. Barbagli, F. Dell’Orletta, G. Venturi, P. Lucisano, Pandas 2.0.1, Zenodo, 2020. URL: https://pandas.</p>
      <p>S. Montemagni, Il ruolo delle tecnologie del linguag- pydata.org/. doi:10.5281/zenodo.3509134.
gio nel monitoraggio dell’evoluzione delle abilità [13] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
di scrittura: primi risultati (2015) 105–123. URL: B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
https://journals.openedition.org/ijcol/326. R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
[4] S. A. Crossley, Linguistic features in writing qual- D. Cournapeau, M. Brucher, M. Perrot, E.
Duchity and development: An overview, Journal of esnay, Scikit-learn: Machine learning in Python,
Writing Research 11 (2020) 415–443. doi:10.17239/ Journal of Machine Learning Research 12 (2011)
jowr-2020.11.03.01. 2825–2830. URL: https://scikit-learn.org/.
[5] M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprug- [14] A. Barbagli, P. Lucisano, F. Dell’Orletta, S.
Monnoli, G. Venturi, EVALITA 2023: Overview of the temagni, G. Venturi, CItA: an L1 Italian
learn8th evaluation campaign of natural language pro- ers corpus to study the development of writing
cessing and speech tools for Italian, in: Proceedings competence, in: Proceedings of the Tenth
Interof the Eighth Evaluation Campaign of Natural Lan- national Conference on Language Resources and
guage Processing and Speech Tools for Italian. Final Evaluation, European Language Resources
AssoWorkshop (EVALITA 2023), CEUR.org, Parma, Italy, ciation, Portorož, Slovenia, 2016, pp. 88–95. URL:
2023. https://aclanthology.org/L16-1014.
[6] C. Alzetta, D. Brunato, F. Delll’Orletta, A. Miaschi, [15] A. Yamada, S. Davidson, P. Fernández-Mira,
K. Sagae, C. H. Sánchez-Gutiérrez, G. Venturi, Lan- A. Carando, K. Sagae, C. Sánchez-Gutiérrez,
COWSgLearn at EVALITA 2023: Overview of the language L2H: A corpus of Spanish learner writing, Research
learning development task, in: Proceedings of the in Corpus Linguistics 8 (2020) 17–32. doi:10.32714/
Eighth Evaluation Campaign of Natural Language ricl.08.01.02.</p>
      <p>Processing and Speech Tools for Italian. Final Work- [16] M. Stills, Language Sample Length Efects on
Varshop (EVALITA 2023), CEUR.org, Parma, Italy, 2023. ious Lexical Diversity Measures: An Analysis of
[7] M. Honnibal, I. Montani, spaCy 2: Natural language Spanish Language Samples from Children,
Techniunderstanding with Bloom embeddings, convolu- cal Report, Portland State University, 2016. doi:10.
tional neural networks and incremental parsing, 15760/honors.250.</p>
      <p>2017. URL: https://spacy.io/.
[8] L. Forti, G. Grego Bolli, F. Santarelli, V. Santucci,</p>
      <p>S. Spina, MALT-IT2: A new resource to measure A. Online Resources
text dificulty in light of CEFR levels for Italian L2
learning, in: Proceedings of the Twelfth Language • The bot.zen system for the EVALITA 2023
LanResources and Evaluation Conference, European gLearn shared task (on GitHub)
Language Resources Association, Marseille, France,
2020, pp. 7204–7211. URL: https://aclanthology.org/
2020.lrec-1.890.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>