Automatic Assessment of English CEFR Levels Using BERT Embeddings

                        Veronica Juliana Schmalz1,3 , Alessio Brutti1,2
                      1. Free University of Bozen-Bolzano, Bolzano, Italy
                           2. Fondazione Bruno Kessler, Trento, Italy
                   3. KU Leuven, imec research group itec, Kortrijk, Belgium
              veronicajuliana.schmalz@kuleuven.be, brutti@fbk.it


                        Abstract                                1   Introduction
    The automatic assessment of language                        Finding a system which objectively evaluates lan-
    learners’ competences represents an in-                     guage learners’ competences is a daunting task.
    creasingly promising task thanks to recent                  Several aspects need to be considered, including
    developments in NLP and deep learning                       both subjective factors, like age, native language,
    technologies. In this paper, we propose the                 cognitive capacities of the learner, and learning-
    use of neural models for classifying En-                    related factors, for example the amount and type
    glish written exams into one of the Com-                    of received linguistic input (James, 2005; Chapelle
    mon European Framework of Reference                         and Voss, 2008; Jang, 2017). Indeed, language
    for Languages (CEFR) competence levels.                     competences are not holistic, but concern differ-
    We employ pre-trained Bidirectional En-                     ent domains, so that considering the mere formal
    coder Representations from Transformers                     correctness of learners’ language has been shown
    (BERT) models which provide efficient                       not to represent a proper assessment procedure
    and rapid language processing on account                    (Roever and McNamara, 2006; Harding and Mc-
    of attention-based mechanisms and the ca-                   Namara, 2017; Chapelle, 2017). Moreover, hu-
    pacity of capturing long-range sequence                     man evaluators, despite having to adhere to a pre-
    features. In particular, we investigate on                  defined scale and guidelines, such as the CEFR
    augmenting the original learner’s text with                 (Council of Europe, 2001), have proved to be
    corrections provided by an automatic tool                   biased (Karami, 2013) and inaccurate (Figueras,
    or by human evaluators. We consider dif-                    2012). For these reasons, new language testing
    ferent architectures where the texts and                    methods and tools have been developed. Cur-
    corrections are combined at an early stage,                 rent state-of-the-art models, such as Transform-
    via concatenation before the BERT net-                      ers, allow to process numerous and complex lin-
    work, or as late fusion of the BERT em-                     guistic data efficiently and rapidly, by means of
    beddings. The proposed approach is eval-                    attention-based mechanisms and deep neural net-
    uated on two open-source datasets: the                      works that capture the relevant features for the tar-
    English First Cambridge open language                       geted task. However, the creation and access to
    Database (EFCAMDAT) and the Cam-                            necessary language examination resources includ-
    bridge Learner Corpus for the First Cer-                    ing annotations and metadata appear to date lim-
    tificate in English (CLC-FCE). The ex-                      ited. In this paper, we propose using a series of
    perimental results show that the proposed                   BERT-base models to automatically assign CEFR
    approach can predict the learner’s compe-                   levels to language learners’ exams.
    tence level with remarkably high accuracy,
    in particular when large labelled corpora                      Our aim is examining the possibility of provid-
    are available. In addition, we observed                     ing the system with previously generated correc-
    that augmenting the input text with correc-                 tions, either by humans or automatically with a
    tions provides further improvement in the                   language checker. Additionally, we want to anal-
    automatic language assessment task.                         yse the impact of the amount of data on the ac-
                                                                curacy of the model in the classification of writ-
     Copyright © 2021 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-       ten exams taken from the English First Cam-
ternational (CC BY 4.0).                                        bridge Open Language Database (EFCAMDAT)
(Geertzen et al., 2013) and the Cambridge Learner      more, a standard scale is needed, which can be ex-
Corpus for the First Certificate in English (CLC-      tended between different groups of learners. In
FCE) (Yannakoudakis et al., 2011). In this way,        addition, powerful computational resources, and
a significant turning point could be made both in      in certain cases, significant memory, are required.
improving the functioning of these automatic sys-      All these elements together constitute fundamen-
tems and in the future collection of data from other   tal pre-requisites which can be difficultly fulfilled.
languages.                                             For this reason, we present a distinct approach
                                                       to the previous ones which, starting from differ-
2   Related Works                                      ent amounts of students’ original texts, provides a
                                                       classification within the different CEFR levels ex-
Automatic language assessment methods concern          ploiting BERT-base models and subsidiary correc-
the creation of fast, effective, unbiased and cross-   tions.
linguistically valid systems that can both sim-
plify assessment and render it objective. However,     3   Proposed Approach
achieving such results represents a complex task
that researchers have been addressing for years        The approach we propose for the automatic as-
while experimenting with several methodologies         sessment of the language competences of adult
and techniques. The first developed tools used to      English language learners is based on the use of
mainly deal with written texts and exploited Parts-    Transformer-type architectures performing multi-
of-Speech (PoS) tagging to grade students’ essays      class classification. Among these, BERT-based
(Burstein et al., 2013), and latent semantic anal-     models, characterised by efficient parallel training
ysis to evaluate the content, providing also short     and the capacity of capturing long-range sequence
feedback (Landauer, 2003). Advances in AI, NLP         features, distinguish themselves for their size and
and Automatic Speech Recognition (ASR) led to          amount of training data (Vaswani et al., 2017).
the additional emergence of systems that assess        Being pre-trained on generic large corpora, with
spoken language skills, such as the SpeechRater        Masked Language Modelling (MLM) and Next
(Xi et al., 2008), which considers clarity of ex-      Sentence Prediction (NSP) strategies, they can be
pression, pronunciation and fluency. To date, sev-     conveniently employed in a wide range of tasks,
eral other automatic language assessment tools         including text classification, language understand-
are applied in the domain of large scale testing,      ing and machine translation.
for example Criterion (Attali, 2004), Project Es-         The models we use for our experiments are
say Grade (Wilson and Roscoe, 2020), MyAccess!         grounded on the BERT-base-uncased architecture,
(Chen and Cheng, 2008) and Pigai (Zhu, 2019).          part of the Hugging Face Transformers Library re-
The first can detect grammatical and usage-based       leased in 2019 (Wolf et al., 2020) and inspired by
errors, as well as punctuation mistakes, provid-       BERT (Devlin et al., 2018) from Google Research,
ing also feedback. However, it requires being          that encodes input texts into low-dimensional em-
trained on the specific topics to assess. The sec-     beddings. Our baseline model maps these compact
ond system exploits a training set of human-scored     representations into the CEFR levels using a net-
essays to score unseen texts, evaluating diction,      work with two fully connected layers. Fig. 1(a)
grammar and complexity from statistical and lin-       graphically represents the architecture. Note that
guistic models. Similarly, MyAccess!, calibrated       this approach requires training the final classifier
with a large number of essays, can score learn-        only. Retraining or fine-tuning the BERT model
ers’ texts and measure advanced features such as       would probably require very large datasets which
syntactic and lexical complexity, content develop-     are not always available for this task. In order to
ment and word choice, providing detailed feed-         augment the input text with corrections (either au-
back. On the contrary, Pigai, exploits NLP to          tomatic or human) we investigate two possible di-
compare the essays submitted by students with          rections. The first one (Fig. 1(b)) concatenates the
those contained in its corpora, measuring the dis-     two texts and applies the pre-trained BERT model.
tance between the two (Zhu, 2019). Despite the         The resulting embeddings are expected to encode
extreme efficiency of these tools, to perform ac-      the information related to both texts. Conversely,
curately they generally need large amounts of la-      the second architecture extracts individual embed-
belled and human-corrected training data. Further-     dings for the original texts and the corrected ones.
           (a)                               (b)                                         (c)

Figure 1: Proposed architectures for CEFR prediction. a) Baseline: original learners’ texts as input;
b) Concatenation: model taking the original learners’ texts and the corrections concatenated; c) Two-
streams: model processing the original learners’ texts and the corrections with separate streams.


These are then merged and processed by the clas-       the CEFR proficiency ones. Each essay has been
sifier, as shown in Fig. 1(c).                         corrected and evaluated by language instructors; in
   We resort to these types of models to be able to    addition to the original texts, their corrected ver-
efficiently process texts capturing long-range se-     sions and annotated errors are also included.
quence features thanks to parallel word-processing        We considered a sub-set of the dataset compris-
and self-attention mechanisms. Regardless of the       ing 100,000 tests. Table 1 reports the distribu-
length of the texts, the architecture should be, in-   tion of the exams across the different CEFR levels,
deed, able to accurately categorise the examina-       including also the average numbers of violations
tions according to the CEFR A1, A2, B1, B2             identified by both humans evaluators and the auto-
and C1 levels of competence. These, in fact, are       matic tool, normalized by the average text length.
fed to the model as labels during the training to-     Note that the average errors per word decrease as
gether with single contextual embeddings, or con-      the level of competence increases. Observe also
catenated ones if corrections are included. Note       that the automatic errors tend to be more numerous
that we do not provide the model with any indica-      than the human ones, in particular for low compe-
tion about the types of errors in the original text.   tence levels. We use the official test partition com-
This information is directly extracted by the model    posed of 1,447 essays. The development set is a
when processing the original text together with its    20% subset of the training set.
corrected version.
                                                       4.2   CLC-FCE Dataset
4     Experimental Analysis
                                                       The CLC-FCE dataset is a collection of texts pro-
We evaluate the architectures described above, us-     duced by adult learners for English as a Second
ing both automatic and human corrections, on           or Other Language (ESOL) examinations from the
two English open-source datasets: EFCAMDAT             First Certificate in English (FCE) written exam
and CLC-FCE. We also experiment varying the            to attest a B2 CEFR level (Yannakoudakis et al.,
amount of training material. The performance of        2011). The learners’ productions, consisting of
the models is measured in terms of weighted clas-      two texts, have been evaluated with a score be-
sification accuracy.                                   tween 0 and 5.3 and the errors have been classified
                                                       in 77 classes. Following the guidelines of the au-
4.1    EFCAMDAT Dataset                                thors, the average score of the two texts has been
The EFCAMDAT dataset constitutes one of the            mapped to CEFR levels, as shown in Table 2. Note
largest language learners datasets currently avail-    that only 4 levels are available in this dataset and
able (Geertzen et al., 2013). The version we use       that the labels do not uniformly match the ones
contains 1,180,310 essays submitted by adult En-       present in EFCAMDAT. Table 2 reports also the
glish learners from more than 172 different nation-    distributions of the texts across the 4 classes with
alities, covering 16 distinct levels compliant with    the error partitions. We notice that, in this case,
                                        average    manual errors        automatic errors
                  levels    n. exams
                                         length      per word              per word
                    A1       37,290        40         4 · 10−2              10 · 10−2
                    A2       36,618        67         4 · 10−2              6 · 10−2
                    B1       18,119        92         4 · 10−2              5 · 10−2
                    B2       6,042         129        3 · 10−2              4 · 10−2
                    C1       1,732         170        2 · 10−2              3 · 10−2

Table 1: EFCAMDAT dataset (sample of 100,000 exams): number of exams per CEFR level, mean text
length (in tokens), mean number of manually and automatically annotated errors per word.

                                              average    manual errors        automatic errors
             scores      levels   N. exams
                                               length      per word              per word
            0.0 - 1.1      A2       10           220       16 · 10−2              7 · 10−2
            1.2 - 2.3      B1       417          205       14 · 10−2              7 · 10−2
            3.1 - 4.3      B2      1,414         212        9 · 10−2              6 · 10−2
            5.1 - 5.3      C1       265          234        6 · 10−2              4 · 10−2

Table 2: CLC-FCE dataset: assigned scores and number of exams per CEFR level, mean text length (in
tokens), mean number of manually and automatically annotated errors per word.


manual errors have been annotated more in de-           is based on surface text processing, does not use a
tail and they are indeed more numerous than the         deep parser and does not require a fully formalised
automatic ones. In general, the number of er-           grammar. By means of this, we have applied the
rors is higher than what observed in EFCAMDAT.          pre-defined rules for the English language to the
Also for this corpus the average amount of errors       learners’ essays, generating new correct texts for
per word, both automatic and manual, decreases          EFCAMDAT and for CLC-FCE. These were used
as the level increases. The total number of texts       as additional input data for the experiments.
within the corpus is 2,469. We employed a data
partition according to which 2,017 examinations         4.4   Implementation Details
constituted the training set, whereas the remain-       Our models have been implemented using
ing 194 constituted the test set. Differently, 10%      Keras and Hugging-Face’s pre-trained BERT-
of the training material represented the validation     base-uncased architecture (Wolf et al., 2020). The
set. From the entire corpus we had to exclude           models’ encoder module, consisting of a Multi-
10 texts since they were not provided with an as-       Head Attention and Feed Forward component, re-
signed score. Despite its small size, CLC-FCE           ceives as inputs the original learners’ exams, to-
represents an important resource given its system-      gether with additional possible human or auto-
atic analysis of errors and the human corrections       matic corrections. The transformed contextual
provided.                                               embeddings are obtained applying Global Aver-
                                                        age Pooling to the outputs of the pre-trained frozen
4.3   LanguageTool                                      BERT Head. The classifier consists of a Dense
                                                        layer of 768 units, with activation function ReLu
In both datasets, the content written by language
                                                        and a Dropout rate of 0.2, followed by another
learners varies according to the levels of compe-
                                                        Dense layer with less units, 128, and the same ac-
tence they were supposed to demonstrate. In ad-
                                                        tivation function and Dropout rate1 .
dition to the human corrections provided with the
                                                           Lastly, the output layer consists of a Dense layer
data, we have generated automatic corrections us-
                                                        with Softmax as activation function and the mod-
ing LanguageTool (Miłkowski, 2010), a language
                                                        els’ final logits correspond to the different CEFR
checker capable of detecting grammatical, syntac-
                                                        levels within which the texts are respectively clas-
tical, orthographic and stylistic errors to automat-
ically correct texts of different nature and length        1
                                                             https://www.kaggle.com/akensert/bert-base-tf2-0-now-
(Naber and others, 2003). The automatic checker         huggingface-transformer
                                                concatenation            two-streams
                   N. Exams      text only
                                              manual automatic        manual automatic
                      10K          95.2%      95.0%     95.4%         94.3%      94.4%
                      50K          97.1%      97.1%     97.0%         97.1%      97.0%
                      100K         97.4%      97.7%     97.3%         97.4%      97.2%

Table 3: Classification accuracy on EFCAMDAT using different amounts of training data, different
inputs and different architectures.


sified. The selected loss is the Sparse Categorical      are used. Finally, the two-stream approach averag-
Cross-entropy and the evaluation metric is the ac-       ing the BERT embeddings of the two texts, seems
curacy. The model is trained using Adam as op-           to be less performing, although by a small margin.
timizer with learning rate 10−5 for EFCAMDAT             Probably, the averaging operation does not repre-
and 10−4 for CLC-FCE. The batch size is 32 and           sent the most suitable one in this context as it tends
the input text maximum length is set to 450 for          to generate embedding representations which are
EFCAMDAT and 512 for CLC-FCE. These hyper-               somehow intermediate between those of the orig-
parameters were optimized on the related develop-        inal texts and those of the corrections and, hence,
ment sets.                                               less discriminative.
                                                            Table 4 reports the results obtained on the
5   Experimental Results                                 CLC-FCE corpus. With respect to EFCAMDAT,
                                                         this corpus is characterized by a smaller amount
Table 3 reports the classification accuracy on the
                                                         of training material and by a less consistent eval-
EFCAMDAT test set using the proposed architec-
                                                         uation of the input text. These two facts lead to
tures in Fig. 1. Note that although EFCAMDAT
                                                         a clear reduction of the classification accuracy, as
features more than 1 million samples, we limit our
                                                         reported in the table. Due to the lower accuracy
analysis to 100K texts, due to memory issues and
                                                         and smaller size of the training set, the final perfor-
performance saturation. The results include also
                                                         mance of each model has a certain degree of vari-
variations in the amount of training material, con-
                                                         ability, which dependents on the model initializa-
sidering 10K and 50K training exams. These sub-
                                                         tion and on the other random number generations
sets have been obtained sampling in a uniform way
                                                         in the training process. Therefore, we performed
the training set, therefore the distribution of exams
                                                         several runs varying the seed of the random num-
per class does not change.
                                                         ber generator. The average accuracy, as well as the
   First of all, it is worth noting that the best ap-
                                                         standard deviation, are also reported in Table 4.
proach reaches an extremely high classification ac-
curacy (almost 98%). In addition, performance al-                       model         accuracy
most saturates with 50K essays, while with only                        text only     61.5% ± 2.0
10K training samples the accuracy is well above                      manual corr.    60.7% ± 1.8
95%. The use of corrections, concatenated with                       autom. corr.    61.7% ± 1.8
the original text, provides some improvements                        two-streams     61.5% ± 1.3
over the model with original texts only. Auto-
matic corrections seem to be more effective with         Table 4: Classification accuracy on CLC-FCE us-
less training data, while manual annotations out-        ing different architectures and types of correc-
perform the baseline with larger training sets. The      tions. The two-streams model uses automatic cor-
latter can, indeed, be more accurate, in particular      rections. Results are averaged over multiple runs.
for high proficiency levels, but their inherited vari-
ability makes the learning task more difficult. As          Given the limited size of the training set, it is
a consequence, more training samples are needed          not surprising to find rather similar results across
to properly learn how to classify the input text.        all the models. As expected, the manual correc-
This is evident in Table 3 where the manual cor-         tions are the worst performing, since they would
rections are the worst for 10K samples, aligned          require large training sets to learn how to han-
with the baseline with 50K training samples, and         dle human evaluations. It is worth pointing out
the best performing when the 100K training texts         that the amount of errors per word in CLC-FCE
is much larger than in EFCAMDAT, which makes              Education Committee Council of Europe, Council for
the learning task even more complex. Neverthe-              Cultural Co-operation. 2001. Common European
                                                            Framework of Reference for Languages: learning,
less, considering also the standard deviations, the
                                                            teaching, assessment. Cambridge University Press.
models based on automatic corrections are slightly
better than the model using the original texts only.      Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
The two-streams model appears extremely close to             Kristina Toutanova. 2018. Bert: Pre-training of
the concatenation model, but this could be related           deep bidirectional transformers for language under-
                                                             standing. arXiv preprint arXiv:1810.04805.
to the fact that the overall accuracy is not that high.
                                                          Neus Figueras. 2012. The impact of the cefr. ELT
6   Conclusions                                             journal, 66(4):477–485.
In this paper we presented an alternative approach
                                                          Jeroen Geertzen, Theodora Alexopoulou, Anna Korho-
for the efficient and unbiased assessment of the             nen, et al. 2013. Automatic linguistic annotation
competences of English language learners using               of large scale l2 databases: The ef-cambridge open
pre-trained BERT-base models. We structured a                language database (efcamdat). In Proceedings of the
multi-class classification task to map the BERT              31st Second Language Research Forum. Somerville,
                                                             MA: Cascadilla Proceedings Project, pages 240–
embeddings of written exams from the EFCAM-                  254. Citeseer.
DAT and CLC-FCE open-source corpora to five
different levels of the CEFR scale. Alongside the         Luke William Harding and Tim McNamara. 2017.
students’ original texts and the provided manual            Language assessment: The challenge of elf. In
corrections, we automatically generated additional          Routledge Handbook of English as a Lingua Franca.
                                                            Routledge.
corrected versions with LanguageTool, a multi-
faceted and versatile language checker . Thus, we         Carl James. 2005. Contrastive analysis and the lan-
conducted several experiments varying both the              guage learner. Linguistics, language teaching and
type and quantities of the models’ input, as well as        language learning, 120.
the typologies of models. Our results proved that
                                                          Eunice Eunhee Jang. 2017. Cognitive aspects of lan-
BERT-based architectures remarkably succeed in              guage assessment. Language Testing and Assess-
classifying CEFR proficiency levels starting from           ment,, pages 163–177.
original texts, especially with numerically signifi-
cant data. Moreover, we observed that adding au-          Hossein Karami. 2013. The quest for fairness in lan-
                                                            guage testing. Educational Research and Evalua-
tomatic and manual corrections can contribute to            tion, 19(2-3):158–169.
improve the quality of results.
                                                          Thomas K Landauer. 2003. Automatic essay assess-
                                                            ment. Assessment in education: Principles, policy
References                                                  & practice, 10(3):295–308.
Yigal Attali. 2004. Exploring the feedback and revi-
                                                          Marcin Miłkowski. 2010. Developing an open-source,
  sion features of criterion. Journal of Second Lan-
                                                           rule-based proofreading tool. Software: Practice
  guage Writing, 14:191–205.
                                                           and Experience, 40(7):543–566.
Jill Burstein, Joel Tetreault, and Nitin Madnani. 2013.
    The e-rater® automated essay scoring system. In       Daniel Naber et al. 2003. A rule-based style and gram-
    Handbook of automated essay evaluation, pages 77–       mar checker.
    89. Routledge.
                                                          Carsten Roever and Tim McNamara. 2006. Language
Carol A Chapelle and Erik Voss. 2008. Utilizing tech-       testing: The social dimension. International Jour-
  nology in language assessment. Encyclopedia of            nal of Applied Linguistics, 16(2):242–258.
  language and education, 7:123–134.
Carol A Chapelle. 2017. Evaluation of technology and      Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  language learning. The handbook of technology and         Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
  second language teaching and learning, pages 378–         Kaiser, and Illia Polosukhin. 2017. Attention is all
  392.                                                      you need. In Advances in neural information pro-
                                                            cessing systems, pages 5998–6008.
Chi-Fen Emily Chen and Wei-Yuan Eugene Cheng
  Cheng. 2008. Beyond the design of automated writ-       Joshua Wilson and Rod D Roscoe. 2020. Automated
  ing evaluation: Pedagogical practices and perceived        writing evaluation and feedback: Multiple metrics
  learning effectiveness in efl writing classes. Lan-        of efficacy. Journal of Educational Computing Re-
  guage Learning & Technology, 12(2):94–112.                 search, 58(1):87–125.
Thomas Wolf, Julien Chaumond, Lysandre Debut, Vic-
  tor Sanh, Clement Delangue, Anthony Moi, Pier-
  ric Cistac, Morgan Funtowicz, Joe Davison, Sam
  Shleifer, et al. 2020. Transformers: State-of-the-
  art natural language processing. In Proceedings of
  the 2020 Conference on Empirical Methods in Nat-
  ural Language Processing: System Demonstrations,
  pages 38–45.
Xiaoming Xi, Derrick Higgins, Klaus Zechner, and
  David M Williamson. 2008. Automated scoring of
  spontaneous speech using speechratersm v1. 0. ETS
  Research Report Series, 2008(2):i–102.
Helen Yannakoudakis, Ted Briscoe, and Ben Medlock.
  2011. A new dataset and method for automatically
  grading esol texts. In Proceedings of the 49th an-
  nual meeting of the association for computational
  linguistics: human language technologies, pages
  180–189.

Wenxin Zhu. 2019. A study on the application of auto-
  mated essay scoring in college english writing based
  on pigai. In 2019 5th International conference on
  social science and higher education (ICSSHE 2019),
  pages 451–454.