Automatic Assessment of English CEFR Levels Using BERT Embeddings Veronica Juliana Schmalz1,3 , Alessio Brutti1,2 1. Free University of Bozen-Bolzano, Bolzano, Italy 2. Fondazione Bruno Kessler, Trento, Italy 3. KU Leuven, imec research group itec, Kortrijk, Belgium veronicajuliana.schmalz@kuleuven.be, brutti@fbk.it Abstract 1 Introduction The automatic assessment of language Finding a system which objectively evaluates lan- learners’ competences represents an in- guage learners’ competences is a daunting task. creasingly promising task thanks to recent Several aspects need to be considered, including developments in NLP and deep learning both subjective factors, like age, native language, technologies. In this paper, we propose the cognitive capacities of the learner, and learning- use of neural models for classifying En- related factors, for example the amount and type glish written exams into one of the Com- of received linguistic input (James, 2005; Chapelle mon European Framework of Reference and Voss, 2008; Jang, 2017). Indeed, language for Languages (CEFR) competence levels. competences are not holistic, but concern differ- We employ pre-trained Bidirectional En- ent domains, so that considering the mere formal coder Representations from Transformers correctness of learners’ language has been shown (BERT) models which provide efficient not to represent a proper assessment procedure and rapid language processing on account (Roever and McNamara, 2006; Harding and Mc- of attention-based mechanisms and the ca- Namara, 2017; Chapelle, 2017). Moreover, hu- pacity of capturing long-range sequence man evaluators, despite having to adhere to a pre- features. In particular, we investigate on defined scale and guidelines, such as the CEFR augmenting the original learner’s text with (Council of Europe, 2001), have proved to be corrections provided by an automatic tool biased (Karami, 2013) and inaccurate (Figueras, or by human evaluators. We consider dif- 2012). For these reasons, new language testing ferent architectures where the texts and methods and tools have been developed. Cur- corrections are combined at an early stage, rent state-of-the-art models, such as Transform- via concatenation before the BERT net- ers, allow to process numerous and complex lin- work, or as late fusion of the BERT em- guistic data efficiently and rapidly, by means of beddings. The proposed approach is eval- attention-based mechanisms and deep neural net- uated on two open-source datasets: the works that capture the relevant features for the tar- English First Cambridge open language geted task. However, the creation and access to Database (EFCAMDAT) and the Cam- necessary language examination resources includ- bridge Learner Corpus for the First Cer- ing annotations and metadata appear to date lim- tificate in English (CLC-FCE). The ex- ited. In this paper, we propose using a series of perimental results show that the proposed BERT-base models to automatically assign CEFR approach can predict the learner’s compe- levels to language learners’ exams. tence level with remarkably high accuracy, in particular when large labelled corpora Our aim is examining the possibility of provid- are available. In addition, we observed ing the system with previously generated correc- that augmenting the input text with correc- tions, either by humans or automatically with a tions provides further improvement in the language checker. Additionally, we want to anal- automatic language assessment task. yse the impact of the amount of data on the ac- curacy of the model in the classification of writ- Copyright © 2021 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- ten exams taken from the English First Cam- ternational (CC BY 4.0). bridge Open Language Database (EFCAMDAT) (Geertzen et al., 2013) and the Cambridge Learner more, a standard scale is needed, which can be ex- Corpus for the First Certificate in English (CLC- tended between different groups of learners. In FCE) (Yannakoudakis et al., 2011). In this way, addition, powerful computational resources, and a significant turning point could be made both in in certain cases, significant memory, are required. improving the functioning of these automatic sys- All these elements together constitute fundamen- tems and in the future collection of data from other tal pre-requisites which can be difficultly fulfilled. languages. For this reason, we present a distinct approach to the previous ones which, starting from differ- 2 Related Works ent amounts of students’ original texts, provides a classification within the different CEFR levels ex- Automatic language assessment methods concern ploiting BERT-base models and subsidiary correc- the creation of fast, effective, unbiased and cross- tions. linguistically valid systems that can both sim- plify assessment and render it objective. However, 3 Proposed Approach achieving such results represents a complex task that researchers have been addressing for years The approach we propose for the automatic as- while experimenting with several methodologies sessment of the language competences of adult and techniques. The first developed tools used to English language learners is based on the use of mainly deal with written texts and exploited Parts- Transformer-type architectures performing multi- of-Speech (PoS) tagging to grade students’ essays class classification. Among these, BERT-based (Burstein et al., 2013), and latent semantic anal- models, characterised by efficient parallel training ysis to evaluate the content, providing also short and the capacity of capturing long-range sequence feedback (Landauer, 2003). Advances in AI, NLP features, distinguish themselves for their size and and Automatic Speech Recognition (ASR) led to amount of training data (Vaswani et al., 2017). the additional emergence of systems that assess Being pre-trained on generic large corpora, with spoken language skills, such as the SpeechRater Masked Language Modelling (MLM) and Next (Xi et al., 2008), which considers clarity of ex- Sentence Prediction (NSP) strategies, they can be pression, pronunciation and fluency. To date, sev- conveniently employed in a wide range of tasks, eral other automatic language assessment tools including text classification, language understand- are applied in the domain of large scale testing, ing and machine translation. for example Criterion (Attali, 2004), Project Es- The models we use for our experiments are say Grade (Wilson and Roscoe, 2020), MyAccess! grounded on the BERT-base-uncased architecture, (Chen and Cheng, 2008) and Pigai (Zhu, 2019). part of the Hugging Face Transformers Library re- The first can detect grammatical and usage-based leased in 2019 (Wolf et al., 2020) and inspired by errors, as well as punctuation mistakes, provid- BERT (Devlin et al., 2018) from Google Research, ing also feedback. However, it requires being that encodes input texts into low-dimensional em- trained on the specific topics to assess. The sec- beddings. Our baseline model maps these compact ond system exploits a training set of human-scored representations into the CEFR levels using a net- essays to score unseen texts, evaluating diction, work with two fully connected layers. Fig. 1(a) grammar and complexity from statistical and lin- graphically represents the architecture. Note that guistic models. Similarly, MyAccess!, calibrated this approach requires training the final classifier with a large number of essays, can score learn- only. Retraining or fine-tuning the BERT model ers’ texts and measure advanced features such as would probably require very large datasets which syntactic and lexical complexity, content develop- are not always available for this task. In order to ment and word choice, providing detailed feed- augment the input text with corrections (either au- back. On the contrary, Pigai, exploits NLP to tomatic or human) we investigate two possible di- compare the essays submitted by students with rections. The first one (Fig. 1(b)) concatenates the those contained in its corpora, measuring the dis- two texts and applies the pre-trained BERT model. tance between the two (Zhu, 2019). Despite the The resulting embeddings are expected to encode extreme efficiency of these tools, to perform ac- the information related to both texts. Conversely, curately they generally need large amounts of la- the second architecture extracts individual embed- belled and human-corrected training data. Further- dings for the original texts and the corrected ones. (a) (b) (c) Figure 1: Proposed architectures for CEFR prediction. a) Baseline: original learners’ texts as input; b) Concatenation: model taking the original learners’ texts and the corrections concatenated; c) Two- streams: model processing the original learners’ texts and the corrections with separate streams. These are then merged and processed by the clas- the CEFR proficiency ones. Each essay has been sifier, as shown in Fig. 1(c). corrected and evaluated by language instructors; in We resort to these types of models to be able to addition to the original texts, their corrected ver- efficiently process texts capturing long-range se- sions and annotated errors are also included. quence features thanks to parallel word-processing We considered a sub-set of the dataset compris- and self-attention mechanisms. Regardless of the ing 100,000 tests. Table 1 reports the distribu- length of the texts, the architecture should be, in- tion of the exams across the different CEFR levels, deed, able to accurately categorise the examina- including also the average numbers of violations tions according to the CEFR A1, A2, B1, B2 identified by both humans evaluators and the auto- and C1 levels of competence. These, in fact, are matic tool, normalized by the average text length. fed to the model as labels during the training to- Note that the average errors per word decrease as gether with single contextual embeddings, or con- the level of competence increases. Observe also catenated ones if corrections are included. Note that the automatic errors tend to be more numerous that we do not provide the model with any indica- than the human ones, in particular for low compe- tion about the types of errors in the original text. tence levels. We use the official test partition com- This information is directly extracted by the model posed of 1,447 essays. The development set is a when processing the original text together with its 20% subset of the training set. corrected version. 4.2 CLC-FCE Dataset 4 Experimental Analysis The CLC-FCE dataset is a collection of texts pro- We evaluate the architectures described above, us- duced by adult learners for English as a Second ing both automatic and human corrections, on or Other Language (ESOL) examinations from the two English open-source datasets: EFCAMDAT First Certificate in English (FCE) written exam and CLC-FCE. We also experiment varying the to attest a B2 CEFR level (Yannakoudakis et al., amount of training material. The performance of 2011). The learners’ productions, consisting of the models is measured in terms of weighted clas- two texts, have been evaluated with a score be- sification accuracy. tween 0 and 5.3 and the errors have been classified in 77 classes. Following the guidelines of the au- 4.1 EFCAMDAT Dataset thors, the average score of the two texts has been The EFCAMDAT dataset constitutes one of the mapped to CEFR levels, as shown in Table 2. Note largest language learners datasets currently avail- that only 4 levels are available in this dataset and able (Geertzen et al., 2013). The version we use that the labels do not uniformly match the ones contains 1,180,310 essays submitted by adult En- present in EFCAMDAT. Table 2 reports also the glish learners from more than 172 different nation- distributions of the texts across the 4 classes with alities, covering 16 distinct levels compliant with the error partitions. We notice that, in this case, average manual errors automatic errors levels n. exams length per word per word A1 37,290 40 4 · 10−2 10 · 10−2 A2 36,618 67 4 · 10−2 6 · 10−2 B1 18,119 92 4 · 10−2 5 · 10−2 B2 6,042 129 3 · 10−2 4 · 10−2 C1 1,732 170 2 · 10−2 3 · 10−2 Table 1: EFCAMDAT dataset (sample of 100,000 exams): number of exams per CEFR level, mean text length (in tokens), mean number of manually and automatically annotated errors per word. average manual errors automatic errors scores levels N. exams length per word per word 0.0 - 1.1 A2 10 220 16 · 10−2 7 · 10−2 1.2 - 2.3 B1 417 205 14 · 10−2 7 · 10−2 3.1 - 4.3 B2 1,414 212 9 · 10−2 6 · 10−2 5.1 - 5.3 C1 265 234 6 · 10−2 4 · 10−2 Table 2: CLC-FCE dataset: assigned scores and number of exams per CEFR level, mean text length (in tokens), mean number of manually and automatically annotated errors per word. manual errors have been annotated more in de- is based on surface text processing, does not use a tail and they are indeed more numerous than the deep parser and does not require a fully formalised automatic ones. In general, the number of er- grammar. By means of this, we have applied the rors is higher than what observed in EFCAMDAT. pre-defined rules for the English language to the Also for this corpus the average amount of errors learners’ essays, generating new correct texts for per word, both automatic and manual, decreases EFCAMDAT and for CLC-FCE. These were used as the level increases. The total number of texts as additional input data for the experiments. within the corpus is 2,469. We employed a data partition according to which 2,017 examinations 4.4 Implementation Details constituted the training set, whereas the remain- Our models have been implemented using ing 194 constituted the test set. Differently, 10% Keras and Hugging-Face’s pre-trained BERT- of the training material represented the validation base-uncased architecture (Wolf et al., 2020). The set. From the entire corpus we had to exclude models’ encoder module, consisting of a Multi- 10 texts since they were not provided with an as- Head Attention and Feed Forward component, re- signed score. Despite its small size, CLC-FCE ceives as inputs the original learners’ exams, to- represents an important resource given its system- gether with additional possible human or auto- atic analysis of errors and the human corrections matic corrections. The transformed contextual provided. embeddings are obtained applying Global Aver- age Pooling to the outputs of the pre-trained frozen 4.3 LanguageTool BERT Head. The classifier consists of a Dense layer of 768 units, with activation function ReLu In both datasets, the content written by language and a Dropout rate of 0.2, followed by another learners varies according to the levels of compe- Dense layer with less units, 128, and the same ac- tence they were supposed to demonstrate. In ad- tivation function and Dropout rate1 . dition to the human corrections provided with the Lastly, the output layer consists of a Dense layer data, we have generated automatic corrections us- with Softmax as activation function and the mod- ing LanguageTool (Miłkowski, 2010), a language els’ final logits correspond to the different CEFR checker capable of detecting grammatical, syntac- levels within which the texts are respectively clas- tical, orthographic and stylistic errors to automat- ically correct texts of different nature and length 1 https://www.kaggle.com/akensert/bert-base-tf2-0-now- (Naber and others, 2003). The automatic checker huggingface-transformer concatenation two-streams N. Exams text only manual automatic manual automatic 10K 95.2% 95.0% 95.4% 94.3% 94.4% 50K 97.1% 97.1% 97.0% 97.1% 97.0% 100K 97.4% 97.7% 97.3% 97.4% 97.2% Table 3: Classification accuracy on EFCAMDAT using different amounts of training data, different inputs and different architectures. sified. The selected loss is the Sparse Categorical are used. Finally, the two-stream approach averag- Cross-entropy and the evaluation metric is the ac- ing the BERT embeddings of the two texts, seems curacy. The model is trained using Adam as op- to be less performing, although by a small margin. timizer with learning rate 10−5 for EFCAMDAT Probably, the averaging operation does not repre- and 10−4 for CLC-FCE. The batch size is 32 and sent the most suitable one in this context as it tends the input text maximum length is set to 450 for to generate embedding representations which are EFCAMDAT and 512 for CLC-FCE. These hyper- somehow intermediate between those of the orig- parameters were optimized on the related develop- inal texts and those of the corrections and, hence, ment sets. less discriminative. Table 4 reports the results obtained on the 5 Experimental Results CLC-FCE corpus. With respect to EFCAMDAT, this corpus is characterized by a smaller amount Table 3 reports the classification accuracy on the of training material and by a less consistent eval- EFCAMDAT test set using the proposed architec- uation of the input text. These two facts lead to tures in Fig. 1. Note that although EFCAMDAT a clear reduction of the classification accuracy, as features more than 1 million samples, we limit our reported in the table. Due to the lower accuracy analysis to 100K texts, due to memory issues and and smaller size of the training set, the final perfor- performance saturation. The results include also mance of each model has a certain degree of vari- variations in the amount of training material, con- ability, which dependents on the model initializa- sidering 10K and 50K training exams. These sub- tion and on the other random number generations sets have been obtained sampling in a uniform way in the training process. Therefore, we performed the training set, therefore the distribution of exams several runs varying the seed of the random num- per class does not change. ber generator. The average accuracy, as well as the First of all, it is worth noting that the best ap- standard deviation, are also reported in Table 4. proach reaches an extremely high classification ac- curacy (almost 98%). In addition, performance al- model accuracy most saturates with 50K essays, while with only text only 61.5% ± 2.0 10K training samples the accuracy is well above manual corr. 60.7% ± 1.8 95%. The use of corrections, concatenated with autom. corr. 61.7% ± 1.8 the original text, provides some improvements two-streams 61.5% ± 1.3 over the model with original texts only. Auto- matic corrections seem to be more effective with Table 4: Classification accuracy on CLC-FCE us- less training data, while manual annotations out- ing different architectures and types of correc- perform the baseline with larger training sets. The tions. The two-streams model uses automatic cor- latter can, indeed, be more accurate, in particular rections. Results are averaged over multiple runs. for high proficiency levels, but their inherited vari- ability makes the learning task more difficult. As Given the limited size of the training set, it is a consequence, more training samples are needed not surprising to find rather similar results across to properly learn how to classify the input text. all the models. As expected, the manual correc- This is evident in Table 3 where the manual cor- tions are the worst performing, since they would rections are the worst for 10K samples, aligned require large training sets to learn how to han- with the baseline with 50K training samples, and dle human evaluations. It is worth pointing out the best performing when the 100K training texts that the amount of errors per word in CLC-FCE is much larger than in EFCAMDAT, which makes Education Committee Council of Europe, Council for the learning task even more complex. Neverthe- Cultural Co-operation. 2001. Common European Framework of Reference for Languages: learning, less, considering also the standard deviations, the teaching, assessment. Cambridge University Press. models based on automatic corrections are slightly better than the model using the original texts only. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and The two-streams model appears extremely close to Kristina Toutanova. 2018. Bert: Pre-training of the concatenation model, but this could be related deep bidirectional transformers for language under- standing. arXiv preprint arXiv:1810.04805. to the fact that the overall accuracy is not that high. Neus Figueras. 2012. The impact of the cefr. ELT 6 Conclusions journal, 66(4):477–485. In this paper we presented an alternative approach Jeroen Geertzen, Theodora Alexopoulou, Anna Korho- for the efficient and unbiased assessment of the nen, et al. 2013. Automatic linguistic annotation competences of English language learners using of large scale l2 databases: The ef-cambridge open pre-trained BERT-base models. We structured a language database (efcamdat). In Proceedings of the multi-class classification task to map the BERT 31st Second Language Research Forum. Somerville, MA: Cascadilla Proceedings Project, pages 240– embeddings of written exams from the EFCAM- 254. Citeseer. DAT and CLC-FCE open-source corpora to five different levels of the CEFR scale. Alongside the Luke William Harding and Tim McNamara. 2017. students’ original texts and the provided manual Language assessment: The challenge of elf. In corrections, we automatically generated additional Routledge Handbook of English as a Lingua Franca. Routledge. corrected versions with LanguageTool, a multi- faceted and versatile language checker . Thus, we Carl James. 2005. Contrastive analysis and the lan- conducted several experiments varying both the guage learner. Linguistics, language teaching and type and quantities of the models’ input, as well as language learning, 120. the typologies of models. Our results proved that Eunice Eunhee Jang. 2017. Cognitive aspects of lan- BERT-based architectures remarkably succeed in guage assessment. Language Testing and Assess- classifying CEFR proficiency levels starting from ment,, pages 163–177. original texts, especially with numerically signifi- cant data. Moreover, we observed that adding au- Hossein Karami. 2013. The quest for fairness in lan- guage testing. Educational Research and Evalua- tomatic and manual corrections can contribute to tion, 19(2-3):158–169. improve the quality of results. Thomas K Landauer. 2003. Automatic essay assess- ment. Assessment in education: Principles, policy References & practice, 10(3):295–308. Yigal Attali. 2004. Exploring the feedback and revi- Marcin Miłkowski. 2010. Developing an open-source, sion features of criterion. Journal of Second Lan- rule-based proofreading tool. Software: Practice guage Writing, 14:191–205. and Experience, 40(7):543–566. Jill Burstein, Joel Tetreault, and Nitin Madnani. 2013. The e-rater® automated essay scoring system. In Daniel Naber et al. 2003. A rule-based style and gram- Handbook of automated essay evaluation, pages 77– mar checker. 89. Routledge. Carsten Roever and Tim McNamara. 2006. Language Carol A Chapelle and Erik Voss. 2008. Utilizing tech- testing: The social dimension. International Jour- nology in language assessment. Encyclopedia of nal of Applied Linguistics, 16(2):242–258. language and education, 7:123–134. Carol A Chapelle. 2017. Evaluation of technology and Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob language learning. The handbook of technology and Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz second language teaching and learning, pages 378– Kaiser, and Illia Polosukhin. 2017. Attention is all 392. you need. In Advances in neural information pro- cessing systems, pages 5998–6008. Chi-Fen Emily Chen and Wei-Yuan Eugene Cheng Cheng. 2008. Beyond the design of automated writ- Joshua Wilson and Rod D Roscoe. 2020. Automated ing evaluation: Pedagogical practices and perceived writing evaluation and feedback: Multiple metrics learning effectiveness in efl writing classes. Lan- of efficacy. Journal of Educational Computing Re- guage Learning & Technology, 12(2):94–112. search, 58(1):87–125. Thomas Wolf, Julien Chaumond, Lysandre Debut, Vic- tor Sanh, Clement Delangue, Anthony Moi, Pier- ric Cistac, Morgan Funtowicz, Joe Davison, Sam Shleifer, et al. 2020. Transformers: State-of-the- art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Nat- ural Language Processing: System Demonstrations, pages 38–45. Xiaoming Xi, Derrick Higgins, Klaus Zechner, and David M Williamson. 2008. Automated scoring of spontaneous speech using speechratersm v1. 0. ETS Research Report Series, 2008(2):i–102. Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and method for automatically grading esol texts. In Proceedings of the 49th an- nual meeting of the association for computational linguistics: human language technologies, pages 180–189. Wenxin Zhu. 2019. A study on the application of auto- mated essay scoring in college english writing based on pigai. In 2019 5th International conference on social science and higher education (ICSSHE 2019), pages 451–454.