-

Towards automatic spoken gram matical error correction of L2 learners of English

Stefano Bannò

1 2

Michela Rais

michela.rais@studenti.unitn.it 0

Marco Matassoni

matasso@fbk.eu 2 0 Center for Mind/Brain Sciences, University of Trento , Corso Bettini 31, Rovereto (TN), 38068 , Italy 1 Department of Psychology and Cognitive Science, University of Trento , Corso Bettini 84, Rovereto (TN), 38068 , Italy 2 Fondazione Bruno Kessler , via Sommarive 18, Trento, 38123 , Italy

The demand for learning English as a second language (L2) has been growing consistently over the past decades, as it has become the lingua franca of culture, entertainment, business, and academia. In this regard, mastering grammar is one of the key elements of L2 proficiency. In this paper, we illustrate an approach to spoken grammatical error correction (GEC) in a cascaded fashion using only publicly available training data. Specifically, we start from learners' utterances, investigate disfluency detection (DD) and removal, and finally explore GEC. Despite using only publicly available data, we achieve promising results that are aligned with previous studies which leveraged a large proprietary dataset. We discuss these results and reflect on some open issues and challenges of spoken GEC. computer-assisted language learning, spoken grammatical error correction, disfluency detection, L2 assessment and feedback With the rise of English as the global language of cul- the ones made by L2 learners in written texts. As a result, has been consistently increasing over the past decades [1]. speech recognition (ASR) module is used to transcribe the This has resulted in a growing interest in automated ap- spoken text. This is followed by a disfluency detection Ital-IA 2023: 3rd National Conference on Artificial Intelligence, orga-

1. Introduction ture, entertainment, business, and academia, the ability to speak it fluently has become increasingly valued and the demand for learning English as a second language (L2) proaches to evaluate spoken language proficiency for applications in Computer-Assisted Language Learning (CALL) for both individual practice and classroom settings, as well as to certify proficiency in language exams.

In particular, the assessment of learners’ grammar through grammatical error correction (GEC) has attracted considerable attention over the past years. While textbased GEC has become an established area of study [2, 3], spoken GEC is still a relatively new area of research, mainly due to the limited availability of specifically demar requires several adjustments to standard GEC models as these tend not to generalize to speech. Spoken GEC (see Table 2) is in fact more challenging than written GEC (see Table 1) as spoken grammar tends to be more flexible and less encoded than written grammar [5]. L2 spoken grammar is often characterized by disfluencies, naturally nized by CINI, May 29–31, 2023, Pisa, Italy ∗Corresponding author. our models, which we tested on a the TLT-GEC, a subset

Italian learners of English presented in [7]. For the DD

module training we employed two corpora, the NICTsigned and annotated data [4]. Assessing spoken gram- transformer-based models both for DD and spoken GEC 0000-0002-2799-0601 (S. Bannò); 0009-0006-5873-8894 (M. Rais); of the TLT corpus, a small proprietary corpus of young Original

2.1. NICT-JLE

The National Institute of Information and Communications Technology - Japanese Learner English (NICT-JLE) corpus, originally introduced in [12], is a collection of manual transcriptions of approximately 300 hours of oral interviews of Japanese learners of English which does not include the original audio recordings.1 A subset of the corpus was manually annotated with disfluencies as well as grammatical errors which were corrected. Furthermore, this subset includes annotations about proficiency scores ranging from A1 to B2 of the Common European Framework of Reference (CEFR) [13].

2.2. KIT Speaking Test Corpus The Kyoto Institute of Technology (KIT) Speaking Test

Corpus, released for public use by [14] consists of manual transcriptions of approximately 4,448 hours of interviews of 574 Japanese undergraduate students.2 As in the case of NICT-JLE, the corpus does not include the original audio recordings. The manual annotations follow the tagging system employed in the NICT-JLE corpus, however these only include disfluencies, whereas grammatical errors are not annotated. The proficiency level of the students approximately ranges from CEFR level A1 to B2.

2.3. EFCAMDAT EFCAMDAT is one of the largest publicly available L2

learner corpus and consists of 1,180,310 scripts written

1alaginrc.nict.go.jp/nict_jle/index_E.html#license 2kitstcorpus.jp/

by 174,743 L2 learners.3 The scripts are annotated with POS tags and information on grammatical dependencies, and are partially error-tagged by human experts. After excluding noisy responses and incorrect annotations, we kept 762,475 responses from which we removed punctuation and capitalisation in order to make them more similar to speech transcriptions. We used spaCy4 to extract pairs of parallel sentences (i.e., original versus correct) from which we removed sentences shorter than 4 words as well as those containing broken XML tags and manual annotations on word limit. Following [15], we further excluded parallel sentences where the token edit distance is higher than 60% of the length of the original sentence in order to guarantee consistency between the original sentences and their corrected counterparts. 2.4. BEA-2019

The corpora from the BEA 2019 shared task are text-based

corpora tagged with GEC annotations.5 CLC-FCE: the Cambridge Learner Corpus - First Certificate English (CLC-FCE) [ 16] is a publicly available section of the larger proprietary Cambridge Learner Corpus (CLC) [17] consisting of 1244 FCE exam scripts.6 Write & Improve: it is a dataset derived from Write & Improve with Cambridge, an online platform where L2 learners of English can practise their writing skills [18].7 LOCNESS: it is a section of the the Louvain Corpus of Native English Essays (LOCNESS), consisting of 100 essays written by L1 English undergraduates from the United Kingdom and the United States [19].

Lang-8: The Lang-8 Corpus of Learner English is a dataset extracted from the Lang-8 website,8 whose users are encouraged to correct each other’s grammar [20, 21].

NUCLE: The National University of Singapore Corpus of Learner English (NUCLE) is a collection of 1,400 3philarion.mml.cam.ac.uk/ 4spacy.io 5cl.cam.ac.uk/research/nl/bea2019st/#data 6ilexir.co.uk/datasets/index.html 7writeandimprove.com/ 8lang-8.com/ essays written by Asian undergraduate students at the model is trained on NICT-JLE and KIT Speaking Test National University of Singapore [22]. Corpus and uses an Adam optimiser [27] with batch size 64, learning rate 1e-06, dropout rate 0.2, and negative log

Including EFCAMDAT, the data used for training likelihood as loss. the spoken GEC system amount to 2,552,825 sentences, For evaluation, we use precision, recall, and 1 scores. which we randomly split into a training set of 2,527,296 Table 4 shows the results of the DD model on the test and a development set of 25,529 sentences. and development sets of TLT-GEC in terms of precision,

As a benchmark for assessing the performance of spo- recall and 1 score. ken GEC system we employed the same test set of the CLC-FCE corpus used in previous studies ([23, 4]) with punctuation and capitalisation removed. 4. GEC 2.5. TLT-GEC The TLT-GEC is a small proprietary dataset of speech utterances of young Italian learners of English which we have manually annotated with disfluencies and two sets of grammatical error corrections performed by two diferent human annotators. The dataset is derived from the larger TLT-school corpus presented by [7] and contains 1127 sentences for a total of 4.96 hours. The CEFR proficiency levels of the speakers are approximately A2 and B1. The data was split into two sets, a development set of 605 sentences and a test set of 522 sentences with non-overlapping speakers. The ASR transcriptions were obtained through a Conformer model, made available by NVIDIA in the popular NeMo toolkit 9. The Conformer architecture [24] efectively combines selfattention layers and convolutions blocks to learn simultaneously global and local local correlations; this variant uses a decoder based on CTC loss instead of a standard RNNT/Transducer, substituting the auto-regressive LSTM component with a simpler linear decoder. The word error rate (WER) is 24.72% considering both development and test sets. 3. Disfluency detection

We performed DD as a sequence tagging task using a BERT-based [25] token classifier:

d1∶ = BERT( 1∶ ) ( | 1∶ ) = (d ) where rm is a binary tag which indicates whether word wm is fluent or disfluent. Subsequently, all words classiifed as disfluencies are removed from the transcriptions. Table 3 considers the example previously shown in Table 2 and clarifies each passage once again.

Specifically, the BERT-based model consists of a BERT layer in the version provided by the HuggingFace Transformer Library [26] (bert-base-uncased), a dropout layer, a dense layer of 768 nodes, a dropout layer, another dense layer of 128 nodes, and finally the output layer. The

9https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/

stt_en_conformer_ctc_large

For the GEC model, we used a T5 model [28] initialised

from the version provided by the HuggingFace Transformer Library [26] (t5-base) trained on EFCAMDAT and BEA-2019 with the exclusion of the CLC-FCE test set, that we used to compare the results on TLT-GEC. We set the maximum sequence length to 64 using an AdamW optimiser [29] with learning rate 1e-5, batch size 32.

To evaluate the performance of our model, we use two common metrics for GEC, i.e., MaxMatch ( 2) score [30] and General Language Evaluation Understanding (GLEU) metric [31]. The former computes the -score of edits over the optimal phrasal alignment between the hypothesis and the reference sentences, whereas the latter is inspired by BLEU [32] and captures grammatical corrections as well as fluency rewrites.

In Table 5, we report the results of the spoken GEC system on the TLT-GEC test set in terms of 2 and GLEU. For further comparison, we also report the results of our model on the CLC-FCE test and we compare them to the results of the GEC model described in [4]. We also report the agreement between the two human annotators.

Considering the performance on CLC-FCE test set, it can be observed that our proposed model performs moderately better than the model from [4]. These results are quite remarkable, given that we used only publicly available data, whereas [4] employed the entire CLC corpus in addition to the BEA-2019 data.

For completeness, we report the results on TLT-school considering the performance of the GEC model on the manual transcriptions with disfluencies (dsf), with dislfuencies manually removed (flt), and with disfluencies automatically removed (autoflt). As expected, there is a remarkable improvement both in terms of GLEU and 2 when disfluencies are removed from the transcriptions. Finally, we report the performance of our GEC system on ASR transcriptions. It can be observed that also in this case removing disfluencies improves the performance for both metrics. It also noticeable that the performance on the ASR transcriptions (autoflt) is slightly better than the one on manual transcriptions (dsf) in terms of GLEU.

Disfluent Fluent Corrected

he see the thief is catched by policeman the last night that there are still several open problems which are particularly evident in the TLT-GEC data. Specifically, the presence of code-switched words is a challenging issue, as can be seen in the following example drawn from the data (manual transcriptions):10

hello my name is giovanni uhm and i’m from trento and i live in rovereto uhm rovereto is in nord italien uhm uhm and uhm hobby uhm f- f- my favourite hobby uhm is uhm football and and koch

As can be observed, not only does the answer feature

Italian names and toponyms, but it also contains German code-switched words. The output of the GEC system after automatically removing the disfluencies is the following:

hello my name is giovanni and i’m from trento and i live in trento it is in north italien my favourite hobby is football and cooking

It appears to handle the code-switched words nord and

koch quite eficiently, but it fails to correct italien. 11

Therefore, future works will attempt to address the problem of named entities recognition and code5. Conclusions and future works switching in the framework of spoken GEC. Another interesting problem concerns the relevance of In this paper, we explored an approach to automatic spo- learners’ answers to the question prompts. For example, ken grammatical error correction of Italian learners of one of the question prompts is: English using only publicly available training data.

First, we investigated DD. Our DD module achieved What country would you like to visit in the future? a good performance in terms of Precision, Recall and Why? 1 score on both the development and test sets of the TLT-GEC. A sample answer drawn from the data is the following:

The second module of our cascaded framework is a spoken GEC system which achieves results aligned with i like to visit turkey because i like speaking the language previous studies. As we expected, we found that dis- [...] lfuency removal has a positive impact on GEC on both manual and ASR transcriptions of the TLT-GEC. Fur- Although the answer is grammatically correct if conthermore, we observed that the fully automated system sidered individually, it does, in fact, contain a verbal error (i.e., ASR+DD+GEC) achieves higher results than the sys- in relation to the question prompt. We also plan to adtem including manual transcriptions with disfluencies in dress this issue starting from concatenating the question terms of GLEU.

Although we identified disfluencies as problematic elements for spoken GEC and we investigated an eficient way to detect and remove them, we acknowledge 10We only changed the first name and one toponym due to privacy

reasons, but the example is still valid. 11In fact, it also does not correct the agreement error hobby is football and cooking, which should feature hobbies are instead of hobby is. prompt with the learner’s answer.

Finally, we plan to investigate an SSL-based approach (e.g., using wav2vec 2.0 [33] or more recent models such as HuBERT [34] or WavLM [35]) to spoken GEC. Specifically, it would be interesting to generate synthetic audio data using a text-to-speech system on the written learner corpora we used in this paper for training our models. guage Database (EFCAMDAT): Information for users, 2017. [10] Y. Huang, A. Murakami, T. Alexopoulou, A. Korhonen, Dependency parsing of learner English, International Journal of Corpus Linguistics 23 (2018) 28–54. doi:1 0 . 1 0 7 5 / i j c l . 1 6 0 8 0 . h u a . [11] C. Bryant, M. Felice, Ø. E. Andersen, T. Briscoe, The

BEA-2019 shared task on grammatical error correction, in: Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computa[1] P. Howson, The English efect, British Council, Lon- tional Linguistics, Florence, Italy, 2019, pp. 52–75.

don, 2013. [2] Y. Wang, Y. Wang, K. Dang, J. Liu, Z. Liu, A com- [12] dEo.iI:z1u0 m.1i8, 6K5.3 U/vc1h/iWm1 9o-to44, 0H6.. Isahara, The NICT JLE prehensive survey of grammatical error correction, corpus: Exploiting the language learners’ speech ACM Transactions on Intelligent Systems and Tech- database for research and education, International nology (TIST) 12 (2021) 1–51. doi:1 0 . 1 1 4 5 / 3 4 7 4 8 4 0 . journal of the computer, the internet and manage[3] C. Bryant, Z. Yuan, M. R. Qorib, H. Cao, H. T. ment 12 (2004) 119–125.

Ng, T. Briscoe, Grammatical error correction: A [13] Council of Europe, Common European Framework survey of the state of the art, arXiv preprint of Reference for Languages: Learning, Teaching, AsarXiv:2211.05166 (2022). doi:1 0 . 4 8 5 5 0 / a r X i v . 2 2 1 1 . sessment, Cambridge University Press, Cambridge, 0 5 1 6 6 . 2001. URL: https://rm.coe.int/1680459f97. [4] Y. Lu, S. Bannò, M. J. F. Gales, On assessing and [14] K. Kanzawa, H. Mitsunaga, G. Edmonds, Y. Hato, developing spoken ’grammatical error correction’ Y. Tsubota, M. Mori, Y. Shimizu, Development and systems, in: Proceedings of the 17th Workshop administration of a Skype-based English speaking on Innovative Use of NLP for Building Educational test in a Japanese high school, Bulletin of Kyoto Applications (BEA 2022), Association for Compu- Institute of Technology 14 (2022) 27–47. tational Linguistics, Seattle, Washington, 2022, pp. [15] Y.-C. Lo, J.-J. Chen, C. Yang, J. Chang, Cool English: 51–60. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 2 . b e a - 1 . 9 . a grammatical error correction system based on [5] M. McCarthy, R. Carter, Ten criteria for a spoken large learner corpora, in: Proceedings of the 27th Ingrammar, in: Explorations in corpus linguistics, ternational Conference on Computational LinguisCambridge University Press, 2006, pp. 27–52. tics: System Demonstrations, Association for Com[6] S. Bannò, M. Matassoni, Proficiency assessment of putational Linguistics, Santa Fe, New Mexico, 2018, L2 spoken English using wav2vec 2.0, in: 2022 IEEE pp. 82–85. URL: https://aclanthology.org/C18-2018. Spoken Language Technology Workshop (SLT), [16] H. Yannakoudakis, T. Briscoe, B. Medlock, A new 2023, pp. 1088–1095. doi:1 0 . 1 1 0 9 / S L T 5 4 8 9 2 . 2 0 2 3 . dataset and method for automatically grading ESOL 1 0 0 2 3 0 1 9 . texts, in: Proceedings of the 49th Annual Meeting [7] R. Gretter, M. Matassoni, S. Bannò, D. Falavigna, of the Association for Computational Linguistics: TLT-school: a corpus of non native children speech, Human Language Technologies, Association for in: Proceedings of the Twelfth Language Resources Computational Linguistics, Portland, Oregon, USA, and Evaluation Conference, European Language 2011, pp. 180–189. URL: https://aclanthology.org/ Resources Association, Marseille, France, 2020, pp. P11-1019. 378–385. URL: https://aclanthology.org/2020.lrec-1. [17] D. Nicholls, The Cambridge Learner Corpus: Error 47. coding and analysis for lexicography and ELT, in: [8] J. Geertzen, T. Alexopoulou, A. Korhonen, Au- Proceedings of the Corpus Linguistics 2003 Confertomatic linguistic annotation of large scale L2 ence, 2003, pp. 572–581. databases: The EF-Cambridge Open Language [18] H. Yannakoudakis, Ø. E. Andersen, A. Geranpayeh, Database (EFCAMDAT), in: Proceedings of T. Briscoe, D. Nicholls, Developing an automated the 31st Second Language Research Forum, Cas- writing placement system for ESL learners, Apcadilla Proceedings Project, Somerville, 2013, pp. plied Measurement in Education 31 (2018) 251–267. 240–254. URL: http://www.lingref.com/cpp/slrf/ 2012/paper3100.pdf. [19] Sd.oiG:1r0a.n1g0e8r0,/ 0T8h9 e57c3o4 m7.p2u0 t1e8r. 1le4a6r4 n44e7r.corpus: a ver[9] Y. Huang, J. Geertzen, R. Baker, A. Korhonen, satile new source of data for SLA research, in: T. Alexopoulou, The EF Cambridge Open Lan- S. Granger (Ed.), Learner English on computer, Routledge, London, 1998, pp. 3–18. doi:1 0 . 4 3 2 4 / 9 7 8 1 3 1 5 8 4 1 3 4 2 . regularization, in: International Conference on [20] T. Mizumoto, Y. Hayashibe, M. Komachi, M. Nagata, Learning Representations 2019, 2019.

Y. Matsumoto, The efect of learner corpus size in [30] D. Dahlmeier, H. T. Ng, Better evaluation for gramgrammatical error correction of ESL writings, in: matical error correction, in: Proceedings of the Proceedings of COLING 2012: Posters, The COL- 2012 Conference of the North American Chapter of ING 2012 Organizing Committee, Mumbai, India, the Association for Computational Linguistics: Hu2012, pp. 863–872. URL: https://aclanthology.org/ man Language Technologies, Association for ComC12-2084. putational Linguistics, Montréal, Canada, 2012, pp. [21] T. Tajiri, M. Komachi, Y. Matsumoto, Tense and as- 568–572. URL: https://aclanthology.org/N12-1067. pect error correction for ESL learners using global [31] C. Napoles, K. Sakaguchi, M. Post, J. Tetreault, context, in: Proceedings of the 50th Annual Meet- Ground truth for grammatical error correction meting of the Association for Computational Linguis- rics, in: Proceedings of the 53rd Annual Meeting tics (Volume 2: Short Papers), Association for Com- of the Association for Computational Linguistics putational Linguistics, Jeju Island, Korea, 2012, pp. and the 7th International Joint Conference on Natu198–202. URL: https://aclanthology.org/P12-2039. ral Language Processing (Volume 2: Short Papers), [22] D. Dahlmeier, H. T. Ng, S. M. Wu, Building a large Association for Computational Linguistics, Beijing, annotated corpus of learner English: The NUS cor- China, 2015, pp. 588–593. doi:1 0 . 3 1 1 5 / v 1 / P 1 5 - 2 0 9 7 . pus of learner English, in: Proceedings of the Eighth [32] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a Workshop on Innovative Use of NLP for Building method for automatic evaluation of machine transEducational Applications, 2013, pp. 22–31. lation, in: Proceedings of the 40th annual meeting [23] Y. Fathullah, M. Gales, A. Malinin, Ensemble dis- of the Association for Computational Linguistics, tillation approaches for grammatical error correc- 2002, pp. 311–318. tion, in: ICASSP 2021 - 2021 IEEE International [33] A. Baevski, H. Zhou, A. Mohamed, M. Auli, Conference on Acoustics, Speech and Signal Pro- wav2vec 2.0: A framework for self-supervised learncessing (ICASSP), 2021, pp. 2745–2749. doi:1 0 . 1 1 0 9 / ing of speech representations, in: NeurIPS 2020, I C A S S P 3 9 7 2 8 . 2 0 2 1 . 9 4 1 3 3 8 5 . 2020, pp. 1–12. [24] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, [34] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. LakhoJ. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, tia, R. Salakhutdinov, A. Mohamed, HuBERT: et al., Conformer: Convolution-augmented trans- Self-supervised speech representation learning by former for speech recognition, arXiv preprint masked prediction of hidden units, IEEE/ACM arXiv:2005.08100 (2020). Transactions on Audio, Speech, and Language Pro[25] J. Devlin, M. Chang, L. Kenton, K. Toutanova, BERT: cessing 29 (2021) 3451–3460.

Pre-training of Deep Bidirectional Transformers [35] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, for Language Understanding, arXiv e-prints (2018) J. Li, N. Kanda, T. Yoshioka, X. Xiao, et al., WavLM: arXiv:1810.04805. doi:1 0 . 4 8 5 5 0 / a r X i v . 1 8 1 0 . 0 4 8 0 5 . Large-scale self-supervised pre-training for full [26] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De- stack speech processing, IEEE Journal of Selected langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- Topics in Signal Processing (2022). doi:1 0 . 1 1 0 9 / towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, J S T S P . 2 0 2 2 . 3 1 8 8 1 1 3 .

Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, A. Rush, Transformers: Stateof-the-art natural language processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 38–45. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 .

e m n l p - d e m o s . 6 . [27] D. Kingma, J. Ba, Adam: a method for stochastic optimization, in: International Conference on

Learning Representations, 2014. [28] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang,

M. Matena, Y. Zhou, W. Li, P. J. Liu, et al., Exploring the limits of transfer learning with a unified textto-text transformer., Journal of Machine Learning

Research 21 (2020) 1–67. [29] I. Loshchilov, F. Hutter, Decoupled weight decay