1. Introduction

bot.zen at LangLearn: regressing towards interpretability

Egon W. Stemle

egon.stemle@eurac.edu 0 1 4

Martina Tebaldini

martina.tebaldini@student.unibz.it 1 3 4

Francesca Bonanni

francesca.bonanni@student.unibz.it 1 3 4

Filippo Pellegrino

filippo.pellegrino@student.unibz.it 1 3 4

Paolo Brasolin

paolo.brasolin@eurac.edu 1 4

Greta H. Franzini

greta.franzini@eurac.edu 1 4

Jennifer-Carmen Frey

jennifercarmen.frey@eurac.edu 1 4

Olga Lopopolo

olga.lopopolo@eurac.edu 1 2 4

Stefania Spina

stefania.spina@eurac.edu 1 2 4 0 Faculty of Informatics, Masaryk University , Czech Republic 1 Institute for Applied Linguistics, Eurac Research , Viale Druso, 1, 39100 Bolzano (BZ) , Italy 2 Università per Stranieri di Perugia , Italy 3 University of Bolzano , Italy 4 cessing and Speech Tools for Italian , Sep 7 - 8, Parma, IT

This article describes the bot.zen system that participated in the Language Learning Development (LangLearn) shared task of the EVALITA 2023 campaign. We developed a simple machine learning system with good interpretability for later use, and used the shared task as an opportunity to provide Master's students with hands-on training and practical experience in NLP. sociolinguistic metadata (information about the author), opment (LangLearn) shared task (ST) on automatic lan0000-0002-7008-6394 (J. Frey); 0000-0003-0997-367X (O. Lopopolo); tion 4 concludes with a short discussion.

system description langlearn evalita shared task regression MALT-IT2 bot zen

1. Introduction There has been an increasing interest in using Natural Language Processing (NLP) tools and machine learning

techniques to analyse writing development in first (L1) and second language (L2) acquisition settings. The topic has been explored in Second Language Acquisition (SLA),

Learner Corpus Research (LCR) (e.g. [1]), Corpus Linguistics and in writing development research (e.g. [2]), and its goal is to understand how specific features can reflect writing quality and development.

The analysis of language learner data typically spans linguistic data (information extracted from the text) and textual metadata (information about the text). According to [3], metadata such as reading time, geographic factors, and parents’ occupation level can have an impact on language skill development, whereas [4] finds writing quality and development to be influenced by both text LGOBE (S. Spina) https://iiegn.eu (E. W. Stemle) 0000-0002-7655-5526 (E. W. Stemle); 0000-0003-2471-7797 (P. Brasolin); 0000-0003-1159-5575 (G. H. Franzini); diversity and sophistication, as well as syntactic complexity and text cohesion. Finally, a text usually includes metadata such as the author, the date of creation, the context in which it was written, and a language proficiency rating. This contextual information enhances the overall understanding of the content. All of these research strands can support NLP applications for writing evaluation and assessment, including automatic essay scoring, automatic writing evaluation systems, and automatic classification of text dificulty for learners. (For an in-depth overview and additional references, see [4].)

At EVALITA 2023 [5], the Language Learning Devel

guage development assessment [6] consisted in predicting the relative order of two essays written by the same student. More specifically, the texts provided were in

Italian and Spanish, and came with only a very limited

set of metadata. We participated in this ST to acquire experience with this type of data, and as an opportunity sity of Bolzano1 in NLP scientific work through practical

Our system relies only on the data provided for the ST, generates explicit information about students’ progress out of implicit information in the data and uses regression without Large Language Models (LLMs) or Neural Networks (NNs) with features from an external tool specifically designed for Italian texts. As a result, our system performed well on Italian but poorly on Spanish data.

The rest of the paper is organised as follows: Sec

tion 2 describes the system design and implementation;

Section 3 describes our experiments and results; and Sec

length and linguistic features including lexical density, to involve and train Master’s students from the UniverEVALITA 2023: 8th Evaluation Campaign of Natural Language Pro

master-applied-linguistics/ languages, including Italian and Spanish2.

After tokenisation, we collect 1- to 3-grams of the word forms and the part-of-speech tags. We additionally collect 2-grams of the morphological analyses of the words and 1-grams of a word’s dependency relation. Overall, this amounts to roughly 17,000 features per document.

2. System Design and Implementation Our objective was to develop a simple machine learning

system with good interpretability. Therefore, we prioritised a simple design that could provide transparent explanations for its decision-making process over complex implementation and high predictive performance.

2.1. Data Pre-processing

In a first processing step, we restructure the given ST data, which provides essay ids with their respective time of writing in tabular format, as shown in Figure 1.

2.2. Feature Extraction

We use spaCy [7] and MALT-IT2 [8] in order to transform the raw input data into a meaningful set of informative features, as they provide easy-to-use and reliable feature extraction methods. 2.2.1. spaCy spaCy is an open-source NLP library in Python providing tools for many tasks and pre-trained models for several 2We use the it_core_news_lg and es_core_news_lg models.

6. Discursive Features take into account the cohe

sive structure of a text.

MALT-IT2 has to be invoked externally to process text ifles into a comma-separated values (CSV) file containing one line per document within its feature space; the CSV ifle is subsequently ingested by our system without any additional interaction or knowledge of MALT-IT2. This means that we can swap out MALT-IT2 with a diferent system or add another system capable of producing a document-feature-matrix in CSV format. 2.2.3. CTAP

We also experimented with a version of the Common

Text Analysis Platform (CTAP) [10] adapted for Italian text [11]. Much like MALT-IT2, CTAP is a linguistic complexity measurement tool ofering various statistics and features to analyse text complexity in terms of length, lexical, syntactic and morpho-syntactic aspects. Unfortunately, we encountered some problems while processing the entire dataset. Very short texts, for instance, caused CTAP to end prematurely with no error message, leaving us with no choice but to exclude CTAP features from our system. CTAP is capable of producing a documentfeature-matrix in CSV format and could have been easily integrated into our system. 2.3. Processing Pipeline p i p e l i n e = P i p e l i n e ( s t e p s = [ ( ’ c o m b i n e d _ f e a t u r e s ’ ,

c o m b i n e d _ f e a t u r e s ) , ( ’ s c a l e r ’ , S t a n d a r d S c a l e r ( ) ) , ( ’ r e d u x ’ , TruncatedSVD ( 1 2 5 ) ) , ( ’ e s t i m a t o r ’ , H G B o o s t i n g R e g r e s s o r ( l o s s = ’ s q u a r e d _ e r r o r ’ ) ) space. This space is the combination (concatenation) of all diferent tools after feature extraction (Section 2.2), totalling around 17,200 features. To standardise the data, we use the StandardScaler(), which removes the mean and scales it to unit variance. We also reduce the linear dimensions using the TruncatedSVD() method4. As a results, our processed dataset consists of 125 features. Finally, we use the HistGradientBoostingRegressor() for learning, which is an ensemble method5.

In order to use our system for the ST, we perform data post-processing. We convert the output of our regression model for individual texts into a binary label for pairs of texts that indicated which of the two was written first.

2.4. Optimisation Our data processing pipeline has been implemented in

Python and makes use of the pandas and scikit-learn libraries3.

The diferent parts of our system were optimised towards

our target variable (absolute position) via an ad-hoc grid search in 3-fold cross validation (CV) runs.

The parts we optimised were: the types of spaCy inpandas [12] is an open-source library for data manip- formation to collect6; n-gram ranges and the minimum ulation and analysis that integrates well with other li- document frequencies for the spaCy collectors; the type braries in the Python ecosystem, making it a versatile of dimensionality reduction7 and the number of dimentool for data analysis and preparation. sions to use; the regression algorithm to use8.

Our system uses pandas for internal data representations, manipulations and calculations during data preprocessing (Section 2.1) and the processing of CSV files. 3. Experiments and Results scikit-learn [13] is an open-source machine learning library for Python providing a wide range of algorithms and tools for various tasks, including classification and dimensionality reduction. With a user-friendly and consistent interface, extensive documentation and an estab

3.1. Shared Task (ST) The Language Learning Development (LangLearn) ST [6]

consisted in predicting the relative order of two essays: given a randomly ordered pair (Essay 1, Essay 2) written lished user base, scikit-learn makes it easy to implement 4We perform feature reduction to remove noise or irrelevant informachine learning workflows. mation, and highlight important aspects of the data, enabling the

Our system uses scikit-learn for the main processing, model to make more accurate predictions. as illustrated in Figure 3. 5Ensemble methods combine and aggregate predictions of multiple

The processing pipeline requires a document feature models to improve predictive performance. matrix that represents all texts as vectors in our feature 6ttookkeenn..tdeexpt_, token.lemma_, token.pos_, token.morph, 7PCA(), TruncatedSVD() 3We used Python 3.8.16, pandas 2.0.1, scikit-learn 1.2.2, 8 DecisionTreeRegressor(), SVR(), KernelRidge(), and spacy 3.5.3 + it-core-news-lg 3.5.0 for processing. HistGradientBoostingRegressor() by the same student, the task was to predict whether Essay 1 had been written before Essay 2.

3.2. Shared Task Data The LangLearn ST data contains essays from two diferent

corpora, namely CItA [14] and COWS-L2H [15], with texts in Italian and Spanish, respectively.

Training data includes information on pairs of texts written by the same student at diferent times. Each entry represents the sequence of two essays, and by considering multiple entries with overlapping text-ids we are able to recreate the sequence of all texts for each student (see Section 2.1). The data also contains the texts themselves but no additional (meta)information beyond this.

CItA The CItA corpus (Corpus Italiano di Apprendenti

L1) is a collection of Italian essays written by students learning their first language in seven diferent lower secondary schools in Rome over the course of two years (2012-2013 and 2013-2014). The students were asked to write diferent types of essays, namely reflexive, narrative, descriptive, expository and argumentative. The ST data contains 834 of the total 1,352 essays written but does not provide any information about the type of text. We also analysed the CItA-part of the ST dataset independent of our system’s performance. To this end, we used the original data with texts in Set 1 always written before texts in Set 2. We then used CTAP to calculate feature values for all texts in both sets. Afterwards, we conducted a (paired) t-test to detect features that difered in their means (as a starting point for later research).

We found some evidence that Set 1 had a higher number of ‘basic vocabulary’ words, whereas Set 2 had a higher number of imageability words. Set 1 also had higher TTR and HDD (Hypergeometric Distribution D) measures, but since Set 2 generally had longer texts, length efects certainly come into play [ 16]. Also, Set 1 used more auxiliary verbs, possibly due to a higher presence of past participle verbs. The use of connectives was higher in Set 2, especially for additive and consequence connectives. The number of dependent clauses per sentence did not difer significantly between the sets. Finally, Set 2 contained more sentences and more punctuation marks but sentence length remained constant.

4. Discussion Our system (see Section 2) was relatively simple. Nei

COWS-L2H The COWS-L2H corpus (Corpus of Writ- ther LLMs nor recurrent neural networks (RNNs) were ten Spanish of L2 and Heritage Speakers) is a collection of integrated, nor did we use any data other than those texts created by students of Spanish as a second language provided by the organisers. While our results for the enrolled at a North American university. The students Italian data were satisfactory, we performed very poorly were asked to write multiple compositions at diferent on the Spanish data, as expected: MALT-IT2, our main times throughout the academic quarters, and the essays processing component, was designed for Italian texts were collected over the course of two years, from 2017 to only, which had a negative impact on our system when 2020. The essays were written by the same students, and processing Spanish data, and despite the baseline system the ST data contains 1,426 of the original 3,498 essays. information also being encoded in our features, the presence of too much irrelevant data hampered the overall 3.3. Results performance.

Nevertheless, the ST served as a great opportunity for The performance of our system on the two datasets (as Master’s students to gain practical project work expereported by the ST organisers) was: rience: running into all-too-common data processing, encoding and decoding dificulties whilst navigating the acc f-score intricacies of understanding, analysing and evaluating CItA bot.zen 0.83 0.84 the data for the task at hand. With the help of the literabest 0.93 0.93 ture suggestions provided by the organisers, the students baseline 0.55 0.55 were able to develop relevant ideas and provide targetoriented answers to emerging questions. Although the acc f-score internship was only 150 hours long and did not include COWS-L2H bot.zen 0.50 0.52 the implementation of a functional application9, the stubest 0.75 0.75 dents had the opportunity to familiarise themselves with baseline 0.66 0.66 the crucial stages of a scientific project, documenting all steps into a project report, which was partially incorporated in this paper.

The baseline scores were calculated by training a Lin

earSVM using the number of tokens per document and the Type-Token-Ratio (TTR) of the first 100 tokens in each document as input features.

9Eurac Research took over the task of implementing a functional

application.

Acknowledgments

[9] V. Santucci, F. Santarelli, L. Forti, S. Spina, Automatic classification of text complexity, Applied SciWe would like to thank our colleagues Arianna Bienati, ences 10 (2020) 7285. doi:10.3390/app10207285. Francesco Fernicola and Lionel Nicolas for their support [10] X. Chen, D. Meurers, CTAP: A web-based tool supduring the project. porting automatic complexity analysis, in: Proceedings of the Workshop on Computational Linguistics References for Linguistic Complexity (CL4LC), The COLING 2016 Organizing Committee, Osaka, Japan, 2016, pp. [1] S. A. Crossley, D. S. McNamara, Does writing de- 113–119. URL: https://aclanthology.org/W16-4113. velopment equal writing quality? A computational [11] N. Okinina, J.-C. Frey, Z. Weiss, CTAP for Italinvestigation of syntactic complexity in L2 learn- ian: Integrating components for the analysis of Italers, Journal of Second Language Writing 26 (2014) ian into a multilingual linguistic complexity analy66–79. doi:10.1016/j.jslw.2014.09.006. sis tool, in: Proceedings of the Twelfth Language [2] P. Durrant, M. Brenchley, L. McCallum, Under- Resources and Evaluation Conference, European standing Development and Proficiency in Writing: Language Resources Association, Marseille, France, Quantitative Corpus Linguistic Approaches, 1st ed., 2020, pp. 7123–7131. URL: https://aclanthology.org/ Cambridge University Press, 2021. doi:10.1017/ 2020.lrec-1.880.

9781108770101. [12] The pandas development team, pandas-dev/pandas: [3] A. Barbagli, F. Dell’Orletta, G. Venturi, P. Lucisano, Pandas 2.0.1, Zenodo, 2020. URL: https://pandas.

S. Montemagni, Il ruolo delle tecnologie del linguag- pydata.org/. doi:10.5281/zenodo.3509134. gio nel monitoraggio dell’evoluzione delle abilità [13] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, di scrittura: primi risultati (2015) 105–123. URL: B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, https://journals.openedition.org/ijcol/326. R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, [4] S. A. Crossley, Linguistic features in writing qual- D. Cournapeau, M. Brucher, M. Perrot, E. Duchity and development: An overview, Journal of esnay, Scikit-learn: Machine learning in Python, Writing Research 11 (2020) 415–443. doi:10.17239/ Journal of Machine Learning Research 12 (2011) jowr-2020.11.03.01. 2825–2830. URL: https://scikit-learn.org/. [5] M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprug- [14] A. Barbagli, P. Lucisano, F. Dell’Orletta, S. Monnoli, G. Venturi, EVALITA 2023: Overview of the temagni, G. Venturi, CItA: an L1 Italian learn8th evaluation campaign of natural language pro- ers corpus to study the development of writing cessing and speech tools for Italian, in: Proceedings competence, in: Proceedings of the Tenth Interof the Eighth Evaluation Campaign of Natural Lan- national Conference on Language Resources and guage Processing and Speech Tools for Italian. Final Evaluation, European Language Resources AssoWorkshop (EVALITA 2023), CEUR.org, Parma, Italy, ciation, Portorož, Slovenia, 2016, pp. 88–95. URL: 2023. https://aclanthology.org/L16-1014. [6] C. Alzetta, D. Brunato, F. Delll’Orletta, A. Miaschi, [15] A. Yamada, S. Davidson, P. Fernández-Mira, K. Sagae, C. H. Sánchez-Gutiérrez, G. Venturi, Lan- A. Carando, K. Sagae, C. Sánchez-Gutiérrez, COWSgLearn at EVALITA 2023: Overview of the language L2H: A corpus of Spanish learner writing, Research learning development task, in: Proceedings of the in Corpus Linguistics 8 (2020) 17–32. doi:10.32714/ Eighth Evaluation Campaign of Natural Language ricl.08.01.02.

Processing and Speech Tools for Italian. Final Work- [16] M. Stills, Language Sample Length Efects on Varshop (EVALITA 2023), CEUR.org, Parma, Italy, 2023. ious Lexical Diversity Measures: An Analysis of [7] M. Honnibal, I. Montani, spaCy 2: Natural language Spanish Language Samples from Children, Techniunderstanding with Bloom embeddings, convolu- cal Report, Portland State University, 2016. doi:10. tional neural networks and incremental parsing, 15760/honors.250.

2017. URL: https://spacy.io/. [8] L. Forti, G. Grego Bolli, F. Santarelli, V. Santucci,

S. Spina, MALT-IT2: A new resource to measure A. Online Resources text dificulty in light of CEFR levels for Italian L2 learning, in: Proceedings of the Twelfth Language • The bot.zen system for the EVALITA 2023 LanResources and Evaluation Conference, European gLearn shared task (on GitHub) Language Resources Association, Marseille, France, 2020, pp. 7204–7211. URL: https://aclanthology.org/ 2020.lrec-1.890.