Quale testo è scritto meglio? A Study on Italian Native Speakers’ Perception of Writing Quality Aldo Cerulli• , Dominique Brunato⋄ , Felice Dell’Orletta⋄ • University of Pisa a.cerulli1@studenti.unipi.it ⋄ Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR) ItaliaNLP Lab - www.italianlp.it {dominique.brunato, felice.dellorletta}@ilc.cnr.it Abstract of existing corpora is that they are cross-sectional rather than longitudinal. A notable exception in This paper presents a pilot study focused on the context of Italian as L1 – which is the focus Italian native speakers’ perception of writing quality. A group of native speakers expressed of our contribution – is represented by CItA (Cor- their preferences on 100 pairs of essays ex- pus Italiano di Apprendenti L1), which was jointly tracted from an Italian corpus of compositions developed by the Institute for Computational Lin- written by L1 students of lower secondary guistics of the Italian National Research Council school. Analysing their answers, it was pos- (CNR) of Pisa and the Department of Social and sible to identify a set of linguistic features char- Developmental Psychology at Sapienza University acterizing essays perceived as well written and of Rome (Barbagli et al., 2016): it is the first dig- to assess the impact of students errors on the perception of text quality. The paper describes italized collection of essays written by the same the crowdsourcing technique to collect data as group of Italian L1 learners in the first two years of well as the linguistic analysis and results. the lower secondary school1 . The diachronic and longitudinal nature of CItA 1 Introduction makes it particularly suitable to study the evolution The institution of distance learning paradigms, of L1 writing competence over the two years, as- which has become crucial during the Covid-19 pan- suming that many remarkable changes in writing demic, showed the need to provide schools and uni- skills occur in this period. For instance, in their versities with Natural Language Processing (NLP)- recent work, Miaschi et al. (2021) showed that it is based tools to assist students, teachers and profes- possible to automatically learn the writing develop- sors. Nowadays, language technologies are more ment curve of students: they extracted a wide set of and more exploited to develop educational applica- linguistic features from the essays and used them tions, such as Intelligent Computer-Assisted Lan- to train a binary classification algorithm able to guage Learning (ICALL) systems (Granger, 2003) predict the chronological order of two productions and tools for automated essay scoring (Attali and written by the same pupil at different times. Burstein, 2006) or automatic error detection and The present study ranks among research based correction (Ng et al., 2013). A fundamental re- on CItA, but chooses a different approach from the quirement for developing this kind of applications one just mentioned: instead of tracking the develop- is the availability of electronically accessible cor- ment of students’ writing competence, we focused pora of learners’ productions. Corpora created so on the perception of writing quality by Italian L1 far differ in many respects. For instance, consider- speakers with the aim of understanding whether it ing the types of examined learners, they can gather is possible to find the linguistic features that are cru- productions written by L2 students or by native cially involved in the distinction between ‘better’ speakers: the former have been built for many lan- and ‘worse’ essays according to our target reader. guages (e.g. English, Arabic, German, Hungarian, Basque, Czech, Italian), while the latter are mainly Contributions To the best of our knowledge, this available for English. In both cases, a peculiarity is the first paper that (i) introduces a dataset of 1 Copyright © 2021 for this paper by its authors. Use The corpus is freely available for research goals at permitted under Creative Commons License Attribution 4.0 http://www.italianlp.it/resources/cita-corpus-italiano-di- International (CC BY 4.0). apprendenti-l1/ evaluated essays in terms of perceived writing qual- Number of pairs Survey Selection criteria I year II year ity by means of a crowdsourcing task, (ii) deals 1 Common prompts 5 5 with the correlation between linguistic features and 2 Narrative 10 0 perceived quality of writing and (iii) assesses the 3 Narrative 0 10 impact of students errors on quality perception. 4 Reflexive 10 0 5 Reflexive 0 10 6 Descriptive 8 2 2 Corpus Collection 7 Expository 3 7 8 Argumentative 3 7 As previously mentioned, the starting point of our 9 Error bins 10 0 study was the CItA corpus. It comprises 1,352 10 Error bins 0 10 essays, written by 156 pupils of seven lower sec- ondary schools in Rome (three in the historical Table 1: Criteria used for pairing the essays and number center and four in the suburbs) during the school of essays for each survey. years 2012-2013 and 2013-2014. The productions respond to 124 writing prompts that pertain to five chosen essays pertaining to the same textual typol- textual typologies: reflexive, narrative, descriptive, ogy – assuming that their similarity with regard to expository and argumentative. An additional ‘com- the content could let the annotator focus on stylistic mon prompt’ was presented at the end of each issue to orient their judgment – and paired them school year, in which students were asked to write according to the school year in which they were a letter to advise a younger friend how to compose written. Instead, essays in questionnaires 9 and 10 better essays. The common prompts were aimed at were paired according to their number of errors: understanding how learners internalize the different for each year, we divided the range between the writing instructions given by teachers. minimum amount of errors (0) and the maximum Each essay contained in CItA is also provided by one (49 for the first year, 43 for the second one) a set of metadata tracking students’ biographical, into ten error bins and designed the two surveys sociocultural and sociolinguistic information. Be- choosing a couple of productions for each bin. Sur- yond the longitudinal nature, the most significant veys comparing essays with a similar amount of novelty introduced by CItA regards error annota- errors were meant to understand which categories tion, which was manually performed by a mid- of errors have a greater impact on human judgment. dle school teacher according to a new three-level schema including: the macro-class of error (i.e. 2.2 Human Evaluation grammatical, orthographic and lexical); the class of After designing the surveys, we moved on to their error (i.e. verbs, prepositions, monosyllabes); and implementation using the QuestBase platform2 . the corresponding type of modification required to We defined a three-section structure including the correct it. More details about the CItA collection filling-in instructions, the personal data entry form are reported in Barbagli et al. (2016). and the essays evaluation pages. Filling-in instructions. The first section reported 2.1 Essay Selection the following submission guidelines: For the purpose of our investigation, we selected Ciao! 200 essays from CItA to be submitted to human Il presente sondaggio è rivolto a partecipanti di evaluation. The essays ranged from a minimum madrelingua italiana. La sua compilazione richiede of 141 tokens to a maximum of 1153 tokens and circa 20 minuti. Pima di proseguire, dando il consenso alla partecipazione, ti spieghiamo in cosa consiste. their average length was 359.4 tokens. Then, to Nelle pagine che seguono leggerai dieci coppie di temi gather judgments on writing quality, we created ten scritti da studenti del primo e del secondo anno di scuola questionnaires, each one consisting of ten pairs of media. I testi possono contenere un certo numero di er- rori. Per ciascuna coppia ti chiediamo di indicare quale essays of the same grade, and distribute them to dei due temi ritieni sia scritto meglio. native speakers of all ages and cultural background. Non esistono risposte giuste o sbagliate: conta semplice- mente quello che pensi! Tieni presente che i temi di Table 1 reports the criteria we adopted to select una stessa coppia possono trattare argomenti diversi, ma the pairs of essays. As it can be seen, Survey 1 al- questo non deve influire sul tuo giudizio. lows the comparison between essays responding to La tua partecipazione al sondaggio è completamente libera. Se in qualsiasi momento dovessi cambiare idea the common prompts written by students attending 2 the first or the second grades. In surveys 2-8, we https://story.questbase.com/ Figure 1: Comparison of a pair of essays extracted from one of the ten surveys. e volessi interrompere il test, potrai farlo liberamente. Essays evaluation. The third section comprised Un’ultima cosa: prima di iniziare il sondaggio, ti chiedi- ten pages, each occupied by two side by side essays amo di darci alcune tue informazioni anagrafiche, che serviranno solo a fini statistici. I dati rimarranno comple- and a field to give the answer (Figure 1). The user tamente anonimi e in nessun modo le risposte verranno had to choose the label ‘1’ if they had preferred the associate alla tua persona. first essay, ‘2’ otherwise. Se hai dubbi, curiosità o proposte di miglioramento, scrivimi all’indirizzo: a.cerulli1@studenti.unipi.it. After carrying out a pilot study to test the ade- Buona lettura! quacy of the structure as well as the completeness For the sake of completeness, we also report an and clearness of the instructions, we started col- English translation of the same guidelines: lecting evaluations. Using Linktree3 we added the ten questionnaires links to a single web page and Hello! shared its link through WhatsApp, Facebook and This survey is addressed to Italian native speakers. Its submission requires about 20 minutes. By completing Instagram: clicking on it, users were redirected to it, you give your consent to participation. Before going the page and could access every survey. on, we explain to you what it consists of. In the following pages you will read ten pairs of essays written by Italian L1 learners during the first two years 3 Analysis of Human Judgments of lower secondary school. The essays may contain linguistic errors. For each pair, you are asked to choose We collected 223 annotations distributed quite ho- the best written of the two essays. mogeneously among the ten surveys, except for the No answers are right or wrong: you only have to express your opinion! Bear in mind that the essays of first one, submitted 28 times. It is worth to focus a pair can concern different topics, but this must not on the heterogeneous composition of the readers affect your judgment. Your participation to the survey is completely free. You sample. Concerning sex, the large majority of an- may withdraw from it at any time. swers (183 units, equal to 82.1%) were given by Before starting the survey, we ask you to provide some women, against the 38 (17%) by men; just two personal information that will be used for statistical purposes. Data will remain completely anonymous and people preferred not to specify their gender. will not be connected to you in any way. Regarding age, we divided the group into six If you have doubts, curiosities or improve- bins (Figure 2). The most frequent class (97 units) ment proposals, please write me to the address: a.cerulli1@studenti.unipi.it. was ‘20-24 years’, followed by ‘25-29 years’ (64 Have a good read! units). This means that most readers (72.5%) ranged from 20 to 29 years of age. 35 evaluations Personal data entry form. The surveys were ob- (15.8%) were made by natives between 30 and 39 viously anonymous. However, as we mentioned years of age. People belonging to the remaining before, we asked the annotators to entry some per- bins contributed to the task for an overall 11.7%. sonal information (age, sex, education) for statisti- 3 cal purposes. https://linktr.ee/ Figure 2: Distribution of annotations with respect to Figure 3: Distribution of annotations with respect to readers’ age bins. readers’ education. Finally, Figure 3 shows the distribution of sub- der and calculated the Inter-annotator agreement missions with respect to readers’ education: 91.9% (IAA) of the first 15 and 20 annotations. We im- of annotations were given by people holding an plemented Krippendorff’s alpha (α), a coefficient academic degree (118 units, equal to 53.2%) or that expresses IAA in terms of observed (Do ) and a high school diploma (86 units, equal to 38.7%). casual (De ) disagreement (Krippendorff, 2011): 12 annotators (5.4%) had a middle school certifi- Do cate; 4 (1,8%) held a doctoral degree; the last two α=1− (3) De indicated a non-specific ‘Other’. We noticed that IAA values of the first 15 sub- 3.1 Inter-Annotator Agreement missions ordered by their increasing weighted dis- tance were the highest. Thus, we took them into At this point, we defined a selection function to account (150 total annotations) for the analysis and discard inaccurate annotations and obtain the same discarded the remaining 734 . It is noteworthy that number of coherent annotations for each survey. the selection led us to an average IAA of 0.26, that Thus, we firstly built the average vector of every is a much higher value than the initial 0.12. Rely- survey as the set of ten values ‘1’ or ‘2’ chosen ing on the selected annotations, we established the according to the most assigned label to each pair ‘winning’ and ‘loser’ essay of each pair. of essays; then, we calculated the distance between each survey average vector and all its annotations. 4 Data Analysis We implemented the euclidean metric generalized to the n-dimensional space that computes the dis- We carried out two evaluations: a first one was tance between two vectors as the square root of the meant to identify which linguistic features impact sum of their sizes squared difference: more on the human assessment of the writing qual- ity; a second one focused on the impact of students v u n uX errors on annotators’ judgments. In what follows t (pk − qk )2 (1) we describe the approach underlying the two per- k=1 spectives and discuss our most interesting findings. To give relevance to the deviating degree of an- 4.1 Linguistic Profiling and Stylistic Analysis swers differing from the average, we assigned every pair a weight (wk ) equal to the number of times in The first analysis relies on linguistic profiling, a which the ‘winning’ essay was chosen; then, we NLP-based methodology in which a large set of computed the weighted distance between annota- linguistically-motivated features automatically ex- tions and average vectors. tracted from annotated texts are used to obtain a v vector-based representation of it. Such representa- u n uX tions can be then compared across texts representa- t wk (pk − qk )2 (2) tive of different textual genres and varieties to iden- k=1 tify the peculiarities of each (Montemagni, 2013; Finally, we ranked weighted and unweighted 4 The corpus of evaluated essays is available at distance values of each survey in ascending or- http://www.italianlp.it/EvaluatedEssays.zip ‘Winning’ ‘Losers’ ‘Winning’ ‘Losers’ Feature Feature Avg. SD Avg. SD Avg. SD Avg. SD n tokens 374.9 127.4 342.7 116.3 verbs tense dist Fut 2.75 4.37 2.47 6.90 ttr form chunks 100 0.72 0.06 0.70 0.06 dep dist cop 1.85 0.98 1.93 1.24 upos dist NOUN 16.31 2.49 16.98 2.63 dep dist flat:foreign 0.03 0.14 0.02 0.17 verbs tense dist Fut 2.75 4.37 2.47 6.90 dep dist flat:name 0.31 0.52 0.32 0.79 verbs form dist Ger 3.13 3.52 2.32 3.25 dep dist det:predet 0.27 0.26 0.24 0.30 aux mood dist Sub 4.41 7.22 2.48 4.51 dep dist parataxis 0.13 0.21 0.15 0.31 n prepositional chains 10.70 6.28 9.50 5.98 obj pre 31.35 13.02 30.02 15.87 verb edges dist 0 1.23 1.62 1.06 1.74 verb edges dist 1 13.45 5.44 12.48 6.30 Table 2: Linguistic features whose average varies sig- upos dist CCONJ 4.17 1.28 4.51 1.61 nificantly between the two subsets. Table 3: The 10 features that, maximally varying in van Halteren, 2004). To perform the analysis, we ‘loser’ essays, are more uniform in the ‘winning’ ones. relied on Profiling-UD5 , a recently introduced tool that allows the extraction of a wide set of lexical, grades. Secondly, we noticed that a richer vocab- morpho-syntactic and syntactic features from texts ulary (ttr form chunks 100) plays a crucial role in linguistically annotated according to the Universal native’s judgment. This is in line with another ad- Dependencies (UD)6 formalism. These features, vice of the just mentioned ranking, Usa un vocabo- described in details in Brunato et al. (2020), have lario ricco ed espressivo (“Use a rich and expres- been shown to be involved in many tasks, all re- sive vocabulary”), that reflects teachers’ encour- lated to modeling the form rather than the content agement to use synonyms in order to write clearer of a text, such as the assessment of text readability and more readable compositions. Values related and linguistic complexity and the identification of to the third feature (upos dist NOUN) reveal that stylistic traits of an author or groups of authors. ‘loser’ essays present a slightly higher distribution We thus split our annotated corpus into two sec- of nouns. A predominant use of nouns is typical tions: one comprised all ‘winning’ essays and the of highly informative texts (e.g. newspaper arti- other all ‘loser’ ones. Using Profiling-UD, we ex- cles, laws), while genres closer to speech contain tracted for each text of the two subsets a feature- more verbs (Montemagni, 2013). Belonging to the based vector representation. For each considered second category, a school essay with fewer nouns feature we calculated the average value, the stan- SD is probably perceived as more coherent with its dard deviation and the coefficient of variation ( Avg ) genre. Concerning verbal inflection, ‘better’ pro- in the two subsets and we assessed whether the vari- ductions include, on average, 0.28% more future ation between mean values was significant using verbs (verbs tense dist Fut), 0.81% more gerund the Wilcoxon rank sum test. verbs (verbs form dist Ger) and 1.93% more sub- Table 2 shows the seven linguistic features junctive auxiliary verbs (aux mood dist Sub). Ver- whose variation turned out to be statistically sig- bal tenses differing from present and moods dif- nificant (p − value < 0.05), ordered by increas- fering from indicative require elevated linguis- ing p-values. It emerges that ‘winning’ essays tic skills, which positively influence annotators’ are on average longer (32.2 tokens more) than choices. The last feature significantly varying be- the ‘losers’ (n tokens), a finding that may suggest tween the two groups is the number of prepositional that longer compositions are evaluated as more chains (n prepositional chains): ‘winning’ compo- reasoned, structured and content-rich. Interest- sitions have, on average, 1.2 more of them. ingly, this also reflects the students’ perception A further study was focused on the variability de- of school writing: Barbagli et al. (2015) showed gree of linguistic features in the two essay groups. that two of the most frequent suggestions contained For each subset, we ordered the features by their in- in essays that respond to ‘common prompts’ are creasingly coefficients of variation; then, we calcu- Leggi/scrivi molto (“Read/write a lot”) and Lavora lated the difference between the two rankings in or- sodo, fai vedere che ti impegni (“Work hard, show der to identify the features that were maximally uni- your dedication”). Thus, pupils possibly write formly distributed in ‘better’ essays as compared more so as to show their dedication and get higher to the ‘worse’ ones (Table 3). It can be noticed 5 http://linguistic-profiling.italianlp.it/ that future verbs (verbs tense dist Fut) are very 6 https://universaldependencies.org/ uniformly distributed among ‘better’ essays. We have previously commented that their frequency ‘Winning’ essays ‘Loser’ essays Category Avg. SD Avg. SD is higher in the ‘winners’; it proves again that na- Grammar 3.28 5.516 4.57 6.126 tives interpret the use of complex verbal forms Orthography 3.18 4.517 4.03 4.826 as an indicator of higher skills. Also parataxis distribution (dep dist parataxis) is quite uniform Table 4: Error categories whose average varies signifi- in ‘winning’ essays; however, its average value is cantly between the two subsets. higher in the ‘loser’ ones. It can be deduced that annotators prefer hypotaxis but this is not surpris- year; moreover, Errori di ortografia (“Orthography ing: hypotactic periods are more structured and errors”) occupies the 6th and the 1st position among elegant and require refined abilities to be built. The the most salient terms respectively of the first and same evidence is given on the morphosyntactic the second year. The non-significant variations of level (upos dist CCONJ), since ‘better’ composi- lexical (p − value = 0.581) and punctuation er- tions include 0.34% less coordinating conjunctions. rors (p − value = 0.617) are probably due to their Curiously, ‘better’ essays have, on average, 0.1% scarce amount in the analysed essays. more foreign terms (dep dist flat:foreign); this may suggest that annotators appreciate these expres- 5 Conclusions sions. Finally, it is worth highlighting a higher and more uniform percentage of verbs with few mod- We presented a pilot study towards the identifica- ifiers in the ‘winning’ essays (verb edges dist 0, tion of the linguistic features that are own of well verb edges dist 1). written perceived essays. We collected Italian na- tives’ preferences on 100 pairs of essays written by L1 students, that we analysed in terms of linguistic 4.2 Students Errors Impact profiling and errors distribution. Our results reveal The last analysis was aimed at assessing whether an interesting correspondence between annotators’ and in what measure students errors impact on hu- judging criteria and writing instructions that L1 man judgments. We counted the pairs of essays learners receive by teachers. Our findings could be whose ‘winning’ composition had a lower number interpreted as an indicator of the reliability of our of errors, those in which the ‘loser’ one had more data and, more in general, could suggest the effec- mistakes and those with an equal number of errors. tiveness of crowdsourcing methods to quickly build We noticed that essays with fewer errors had won large and reliable datasets. Considering the lack in 56% cases, reaching the 79% if including pairs of Italian corpora of graded essays, such datasets with the same number of errors. This procedure could be valuable resources for the development of gave a first empirical answer to our starting ques- Computer-Assisted Learning Systems. tion: errors substantially affect human assessment. The limited size of our dataset certainly reduced At this point, we focused on error categories to the amount of results. Thus, we have to expand it (i) identify which ones affect more the perception of by collecting more annotations for the already exist- writing quality. For each category, we calculated ing surveys and (ii) by creating and distributing new the average number of errors and their standard de- surveys in order to gather judgments on new pairs viation in both subsets; then, relying on Wilcoxon of essays. Analysis on the enlarged dataset could rank sum test, we found out that grammatical and provide more features that are own of good essays. orthographic mistakes vary significantly between Following the model of Miaschi et al. (2021), we the two groups (Table 4). As expected, ‘loser’ could use the results to train a classifier that, given essays have, on average, 1.29 more grammatical a pair of essays, recognizes the best written one. errors and 0.85 more orthographic errors. It is The tool would not presume to replace teachers, worth to add that orthographic mistakes variation but it could be a valuable teaching aid. Students (p − value = 0.007) is more significant than the could use it to get an immediate and preliminary other (p − value = 0.029). This could mean that self-assessment on their written productions so as natives judge deviations in orthography worse than to better understand their mistakes and hopefully those in grammar. Once again, our findings are in avoid repeating them. Such tools can be very useful line with Barbagli et al. (2015): Usa una corretta if integrated into educational processes based on ortografia (“Use correct orthography”) is the 2nd of distance learning paradigms, which need adequate the most frequent suggestions given in the second technological infrastructures to be really efficient. References Yigal Attali and Jill Burstein. 2006. Automated Essay Scoring With e-rater® V. 2. The Journal of Technol- ogy, Learning, and Assessment, 4(3). Alessia Barbagli, Pietro Lucisano, Felice Dell’Orletta, and Giulia Venturi. 2015. Il ruolo delle tecnologie del linguaggio nel monitoraggio dell’evoluzione delle abilità di scrittura: primi risultati. Italian Journal of Computational Linguistics (IJCoL), 1(1):99–117. Alessia Barbagli, Lucisano Pietro, Felice Dell’Orletta, Simonetta Montemagni, and Giulia Venturi. 2016. Cita: an L1 Italian Learners Corpus to Study the De- velopment of Writing Competence. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 88–95, Portorož, Slovenia. European Language Resources Association (ELRA). Dominique Brunato, Andrea Cimino, Felice Dell’Orletta, Simonetta Montemagni, and Giu- lia Venturi. 2020. Profiling-UD: a Tool for Linguistic Profiling of Texts. In Proceedings of the 12th Conference of Language Resources and Evaluation (LREC 2020), pages 7145–7151, Marseille, France. European Language Resources Association (ELRA). Sylviane Granger. 2003. Error-tagged learner corpora and CALL: A promising synergy. CALICO Journal, 20(3):465–480. Hans van Halteren. 2004. Linguistic profiling for au- thor recognition and verification. In Proceedings of the Association for Computational Linguistics, pages 200–207. Klaus Krippendorff. 2011. Computing Krippendorff’s Alpha-Reliability. Technical report, University of Pennsylvania. Alessio Miaschi, Dominique Brunato, and Felice Dell’Orletta. 2021. A NLP-based stylometric ap- proach for tracking the evolution of l1 written lan- guage competence. Journal of Writing Research (JoWR), 13(1):71–105. Simonetta Montemagni. 2013. Tecnologie linguistico- computazionali e monitoraggio della lingua italiana. Studi Italiani di Linguistica Teorica e Applicata (SILTA), pages 145–172. Hwee Tou Ng, Siew Mei Wu, Yuanbin Wu, Christian Hadiwinoto, and Joel Tetreault. 2013. The CoNLL- 2013 Shared Task on Grammatical Error Correction. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task, pages 1–12, Sofia, Bulgaria. Association for Computational Linguistics.