Towards SMT-Assisted Error Annotation of Learner Corpora Nadezda Okinina Lionel Nicolas Eurac Research Eurac Research viale Druso 1, Bolzano, Italy viale Druso 1, Bolzano, Italy nadezda.okinina@eurac.edu lionel.nicolas@eurac.edu ardised and homogeneous way and documented Abstract as to their origin and provenance” (Granger, 2002). Error-annotated learner corpora serve the English. We present the results of proto- needs of language acquisition studies and peda- typical experiments conducted with the gogy development as well as help the creation of goal of designing a machine translation natural language processing tools such as auto- (MT) based system that assists the anno- matic language proficiency level checking sys- tators of learner corpora in performing tems (Hasan et al., 2008) or automatic error de- orthographic error annotation. When an tection and correction systems (see Section 2). In annotator marks a span of text as errone- this paper we present our first attempts at creat- ous, the system suggests a correction for ing a system that would assist annotators in per- the marked error. The presented experi- forming orthographic error annotation by sug- ments rely on word-level and character- gesting a correction for specific spans of text se- level Statistical Machine Translation lected and marked as erroneous by the annota- (SMT) systems. tors. In the prototypical experiments, the sugges- tions are generated by word-level and character- Italian. Presentiamo i risultati degli level SMT systems. esperimenti prototipici condotti con lo This paper is organized as follows: we review scopo di creare un sistema basato sulla existing approaches to automatic error correction traduzione automatica (MT) che assista (Section 2), introduce our experiments (Sec- gli annotatori dei corpora degli appren- tion 3), present the data we used (Section 4), de- denti di lingue durante il processo di an- scribe and discuss the performed experiments notazione degli errori ortografici. Quan- (Section 5) and conclude the paper (Section 6). do un annotatore segna un segmento di testo come errato il sistema suggerisce 2 Related Work una correzione dell’errore segnato. Gli Orthographic errors are mistakes in spelling, hy- esperimenti presentati utilizzano dei si- stemi statistici di traduzione automatica phenation, capitalisation and word-breaks (Abel (SMT) al livello di parole e di caratteri. et al., 2016). Automatic orthographic error cor- rection can benefit from methods recently devel- oped for grammatical error correction (GEC) 1 Introduction such as methods relying on SMT and Neural Manual error annotation of learner corpora is a Machine Translation (NMT) (Chollampatt et al., time-consuming process which is often a bottle- 2017, Ji et al., 2017, Junczys-Dowmunt et al., neck in learner corpora research. “Computer 2016, Napoles et al., 2017, Sakaguchi et al., learner corpora are electronic collections of au- 2017, Schmaltz et al., 2017, Yuan et al., 2016 thentic FL/SL textual data assembled according etc.). These approaches treat error correction as a to explicit design criteria for a particular MT task from incorrect to correct language. In SLA/FLT1 purpose. They are encoded in a stand- the case of orthographic error correction these “languages” are extremely close, which greatly 1 FL: foreign language, SL: second language, SLA: facilitates the MT task. In that aspect, error cor- second language acquisition, FLT: foreign lan- rection is similar to the task of translating close- guage teaching ly-related languages such as, for example, Mace- donian and Bulgarian (Nakov et al., 2012). In our ied (Abel et al., 2013, Abel et al., 2015, Abel et experiments, we rely on the implementation of al., 2016, Abel et al., 2017, Zanasi et al., 2018). SMT models provided by the Moses toolkit Preliminary experiments with the freely available (Koehn et al., 2007). vocabulary-based spell checking tool Hunspell2 SMT and NMT can be easily adapted to new yielded unsatisfactory results (see Section 5.1) languages, but their performance depends on the and incited us to try SMT in order to train an er- amount and quality of the training data. In order ror-correction system and tune it to the specific to make up for lack of parallel corpora of texts nature of our data. We thus performed a series of containing language errors and their correct experiments to perform a preliminary evaluation equivalents, various techniques for resource con- of the range of performances of different n-gram struction have been suggested, such as using the models when trained on small-scale data (Sec- World Wide Web as a corpus (Whitelaw et al., tion 5.1), studied the impact of the similarity be- 2009), parsing corrective Wikipedia edits tween training data and test data to understand (Grundkiewicz et al., 2014) or injecting errors in which datasets are the most optimal to train our error-free text (Ehsan et al., 2013). For our proto- models on (Sections 5.2 and 5.3) and finally typical experiments, we deliberately limit our- made preliminary attempts to improve the per- selves to the manually-curated high-quality data formance by optimising the usage of the SMT at our disposal and use existing German error- systems (Section 5.4). annotated corpora as training data. As our systems are not directly comparable to In recent years learner corpora of German have GEC systems, the usual metrics used to evaluate been used for the creation of systems for auto- GEC systems are not fully adequate, because matic German children’s spelling errors correc- they target a similar but different use case. We tion (Stüker et al., 2011, Laarmann-Quante, thus evaluate our systems according to their ac- 2017), but no work has been done on automatic curacy that we define as a ratio between the orthographic error correction of adult learner number of suggestions matching the target hy- texts. pothesis present in the test data (TH)3 and the whole number of annotated errors. However, 3 Objectives of the Experiments accuracy is not the only criteria as it is also im- portant not to disturb the annotators with irrele- The particularity of our work is that we focus on vant suggestions: it is better not to suggest any a specific use-case where annotators are assisted TH than to suggest a wrong one. In order to con- in error-tagging newly created learner corpora. trol the ratio between right and wrong sugges- To ensure the relevance of our system and limit tions, we also evaluate our systems according to false positives that would hinder its adoption, the their precision. We define precision as a ratio targeted use-case is to only suggest corrections between the number of suggestions matching the while leaving the task of selecting the error to the TH and the whole number of suggestions, correct linguist. Aforementioned GEC systems take as and incorrect, thus excluding the errors for which input text containing language errors and pro- the system was consulted, but no correction was duce corrected text. Thus, they may introduce suggested. Precision is mainly used as a quality changes in any part of the text, even where no threshold which should remain high, whereas our errors are observed. In order to prevent such be- main performance measure is accuracy. havior, we only submit to our system spans of text marked as erroneous by annotators, while 4 Corpora Used leaving out spans of text not containing errors. Therefore, our system is not directly comparable Our experiments rely on three error-annotated to existing GEC systems. learner corpora: KoKo, Falko and MERLIN. A given language error may have more than one KoKo is a corpus of 1.503 argumentative essays possible correction, but in the presented research (811.330 tokens) of written German L1 4 from we limit ourselves to orthographic errors that in high school pupils, 83% of which are native most cases have only one correction (Nerius et speakers of German (Abel et al., 2016). It relies al., 2007). Our system is meant to be used for the creation of new learner corpora in the Institute for Applied Linguistics where learner corpora of 2 http://hunspell.github.io/ 3 German, Italian and English are created and stud- The TH corresponds to a correction associated with each error (Reznicek et al., 2013). 4 first language, native language on a very precise error annotation scheme with learner texts and their corrected versions from 29 types of orthographic errors. Falko and KoKo. In each fold of the 10-fold val- The Falko corpus consists of six subcorpora idation, 1/10 of KoKo is taken out of the training (Reznicek et al., 2012) out of which we are using corpus and used as a validation corpus. the subcorpus of 107 error-annotated written Since our objective was to only observe the texts by advanced learners of L2 5 German overall adequateness of the SMT models, we on- (122.791 tokens). ly attempted to optimise the way the SMT mod- The MERLIN corpus was compiled from stand- els were used at a later stage (see Section 5.4). ardized, CEFR6-related tests of L2 German, Ital- These prototypical experiments showed that all ian and Czech (Boyd et al., 2014). We are using the SMT models have a rather high precision and the German part of MERLIN that contains 1033 that, for this amount of training data, the SMT learner texts (154.335 tokens): a little bit more model that performed best is the word 5-gram than 200 texts for each of the covered CEFR lev- model. It yielded an encouraging result of 39% els (A1, A2, B1, B2, and C1). of accuracy and 89% of precision, which is far Due to the differences in content and format, we better than the 11% of accuracy and 8% of preci- do not use all three learner corpora in all the ex- sion originally obtained with Hunspell. However, periments. KoKo is our main corpus, because of 39% of accuracy were obtained by training on its larger size, easy to use format and detailed Falko and 9/10 of KoKo and validating on 1/10 orthographic error annotation. We use it in train- of KoKo, which would be the configuration we ing, validation and testing of our SMT systems. would have towards the end of the annotation of Falko is smaller and its format does not allow an a new learner corpus. We thus proceeded with easy alignment of orthographic errors, we thus our experiments by testing how the SMT models only use it in some experiments as part of the would perform at an earlier stage. training corpus (Sections 5.1 and 5.2). MERLIN was annotated similarly to KoKo, therefore er- word-grams character-grams ror-correction results obtained for these two cor- 1 3 5 10 6 10 15 pora are easily comparable. Furthermore, MER- Prec. 84% 87% 89% 84% 83% 86% 87% LIN is representative of different levels of lan- Acc. 32% 37% 39% 38% 16% 21% 29% guage mastery. We thus use it for testing some of Table 1: 10-fold validation on KoKo of SMT models our systems (Section 5.2). trained on KoKo and Falko. As the language model for our character-based SMT systems cannot be generated from the lim- 5.2 Testing the Models on New Data ited amount of data provided by learner corpora, At an early stage of the annotation of a new for that purpose we used 3.000.000 sentences of learner corpus, an error-correction system could a German news subcorpus from the Leipzig Cor- be trained on an already existing corpus. We thus pora Collection7. tried to apply the different models trained on 5 Prototypical Experiments Falko, KoKo and the newspapers to MERLIN. However, none of the 7 models presented in the 5.1 Testing Different N-Gram Models previous section achieved more than 13% of ac- curacy and 70% of precision on the whole We started by testing SMT word and character- MERLIN corpus. Despite that, these experiments based language models with various numbers of highlighted an interesting aspect: all the models n-grams in order to understand which one could performed better on MERLIN texts of higher suffer less from data scarcity and thus best suit CEFR levels compared to MERLIN texts of low- our data8 (Table 1). We used Moses default val- er CEFR levels (Table 2). We suspect this phe- ues for all the other parameters. The systems nomenon to be due to the fact that the level of were trained on a parallel corpus composed of language mastery of MERLIN texts of higher CEFR levels is closer to the level of language 5 second language, foreign language mastery of KoKo and Falko texts. This observa- 6 Common European Framework of Reference for tion indicates that the training and test data must Languages attest to the same level of language mastery, be- 7 http://hdl.handle.net/11022/0000-0000-2417-E cause mistakes made by beginner language 8 The computational results presented have been learners tend to differ noticeably from mistakes achieved in part using the Vienna Scientific Cluster made by advanced language learners. Therefore, (VSC). using existing learner corpora as training data is only slightly deteriorated the precision (Table 3, a difficult task as most of them target different line 2). types of learners with different profiles and bias In order to further improve the performance, we towards specific kinds of errors. decided to combine the word-based and charac- ter-based systems. For this first experiment we A1 A2 B1 B2 C1 chose the best-performing of the word-based sys- Prec. 60% 61% 77% 72% 78% tems which is the word 5-gram model and the Acc. 15% 9% 12% 14% 17% second best performing of the character-based Table 2: precision and accuracy of the word 5-gram systems which is the character 10-gram model. model trained on KoKo and Falko when tested on We chose the character 10-gram model for prac- MERLIN texts of different CEFR levels. tical reasons: it is considerably less resource- consuming than the character 15-gram model. By 5.3 Training and Testing on One Corpus applying both the word 5-gram and the character The results of the previous experiments incited 10-gram models to the same data and comparing us to train an SMT model on a small part of a the overlap in their responses, we verified their corpus and test it on a bigger part of the same degree of complementarity. This experiment corpus in order to observe how an SMT model showed that only in 18% of cases the word-based would behave when trained on an already anno- and character-based models both suggest a cor- tated part of a new learner corpus. We thus per- rection (corresponding or not to the TH). In 39% formed 3-fold validation experiments with a of cases only the word-based system suggests a word 5-gram model taking 1/3 of KoKo as train- correction and in 5% of cases only the character- ing data and 2/3 of KoKo as test data and ob- based system suggests a correction. It means that tained 30% of accuracy9. This result was much by combining the two systems it is possible to better than 13% of accuracy we had obtained by improve the overall performance. We calculated training SMT systems on KoKo and Falko and the maximum theoretical accuracy 10 of such a testing them on MERLIN. We thus decided to combined system and came to a conclusion that pursue our experiments with KoKo as both train- it cannot exceed 53% when trained on 1/3 of ing and test data. KoKo and 60% when trained on 2/3 of KoKo In order to observe the evolution of the system’s (Table 3, line 3). performance with the growth of the corpus, we By simply giving preference to the word-based also trained it on 2/3 of KoKo and tested it on model before consulting the character-based 1/3 of KoKo. Augmenting the training corpus model, we almost achieved the maximum theo- size did not change the system’s performance retical accuracy (Table 3, line 4). (Table 3, line 1). Such results tend to indicate However, we realised that by augmenting the that most of the performance can be obtained at training corpus size, we augmented the accuracy, an earlier stage of the annotation process. but slightly deteriorated the precision. By analysing the performance of different mod- 5.4 Improving the Performance ules (word 5-gram highest-ranked suggestions, After evaluating the impact of the training data word 5-gram lower-ranked suggestions, charac- on the system’s performance, we switched our ter 10-gram) on different kinds of errors, we focus to the optimisation of the way SMT models could observe that their performance differs ac- were used. First of all, we tried to take into ac- cording to types of errors. For example, the low- count not only the highest-ranked suggestion of er-ranked suggestions of the word-based model Moses, that in many cases was equal to the error introduce a lot of mistakes in the correction of text (i.e. no correction was suggested), but also errors where one word was erroneously written the lower-ranked suggestions in order to find the as two separate words (e.g. Sommer fest instead highest-ranked suggestion that was different from the error text. This change considerably 10 improved the accuracy for both corpus sizes and The maximum theoretical accuracy would be achieved if it was possible to always choose the right system to consult for each precise error 9 We also calculated the BLEU score for this model (word-based or character-based) and never con- and obtained 95%. This result shows that the sult the system that gave a wrong result when the BLEU score is irrelevant for the evaluation of er- other system gave a correct result. In that case the ror correction systems such as ours that cannot in- maximum potential of both systems would be troduce errors in error-free spans of text. used. of Sommerfest). We tried to prevent such false vant the tuning of parameters can be for such a corrections by not consulting the lower-ranked MT task. suggestions of the word-based model for errors The choice of training data for our experiments containing spaces. By introducing this rule we was dictated by the availability of high-quality succeeded in improving the precision at the cost resources. In future experiments we would like to of loosing some accuracy (Table 3, line 5). This enlarge the spectrum of resources considered for experiment showed that add-hoc rules might not our experiments and work with other languages, be a workable solution and a more sophisticated in particular with Italian and English. approach should be considered if we intend to dynamically combine several systems. In order Acknowledgements to obtain better results combining two or more We would like to thank the reviewers as well as word-based and character-based systems, further our colleagues Verena Lyding and Alexander experiments should be conducted. König for their useful feedback and comments. train. 1/3 train. 2/3 valid. 2/3 valid. 1/3 References 1 word highest-ranked corr. 30% (88%) 30% (88%) Abel, A., Konecny, C., Autelli, E.: Annotation and 2 word lower-ranked corr. 48% (84%) 55% (83%) error analysis of formulaic sequences in an L2 learner corpus of Italian, Third International max. theoretical accuracy 3 word lower-ranked 53% (85%) 60% (84%) Learner Corpus Research Conference, 2015, Book + character of abstracts, pp. 12-15. word lower-ranked Abel, A., Glaznieks, A., Nicolas, L., Stemle, E.: An 4 53% (84%) 59% (83%) + character extended version of the KoKo German L1 Learner word lower-ranked corpus, Proceedings of the Third Italian Confer- 5 +character 52% (88%) 57% (88%) ence on Computational Linguistics CliC-it, Naples, with rule on spaces Italy, 2016, pp. 13-18. Table 3: accuracy and precision (in brackets) of dif- Abel, A., Glaznieks, A.: „Ich weiß zwar nicht, was ferent systems according to training corpus size (3- mich noch erwartet, doch ...“ – Der Einsatz von fold validation on KoKo). Korpora zur Analyse textspezifischer Konstruktionen des konzessiven Argumentierens 6 Conclusion bei Schreibnovizen, Corpora in specialized communication, vol. 4, Bergamo, 2013, pp. 101- Our preliminary experiments brought us to the 132. conclusion that a SMT system trained on a man- Abel, A., Vettori, C., Wisniewski, K.: KOLIPSI. Gli ually annotated part of a learner corpus can be studenti altoatesini e la seconda lingua: indagine helpful in error-tagging the remaining part of the linguistica e psicosociale, vol. 2, Eurac Research, same learner corpus: it is possible to train a sys- 2017. tem that would propose the right correction for half of the orthographic errors outlined by the Boyd, A., Hana, J., Nicolas, L., Meurers, D., Wisniewski, K., Abel, A., Schöne, K., Štindlová, annotators while proposing very few wrong cor- B., Vettori, C.: The MERLIN corpus: Learner lan- rections. Such results are satisfactory enough to guage and the CEFR, Proceedings of the Ninth In- start integrating the system into the annotation ternational Conference on Language Resources tool we use to create learner corpora (Okinina et and Evaluation (LREC), 2014, pp. 1281-1288. al., 2018). The combination of a word-based and a charac- Bryant, C.: Language Model Based Grammatical Er- ror Correction without Annotated Training Data, ter-based systems gave promising results, there- Proceedings of the Thirteenth Workshop on Inno- fore we intend to continue experimenting with vative Use of NLP for Building Educational Appli- multiple combinations of word-based and char- cations, 2018, pp. 247–253. acter-based systems. We are also considering the possibility to rely on other technologies (Bryant, Chollampatt, S., Ng, H.: Connecting the Dots: To- wards Human-Level Grammatical Error Correc- 2018). As in our experiments we only wanted to tion, Proceedings of the 12th Workshop on Innova- observe the range of performances we could ex- tive Use of NLP for Building Educational Applica- pect, we trained our models with the default con- tions, 2017, pp. 327-333. figuration provided with the MOSES toolkit and did not perform any tuning of the parameters. Granger, S.: A Bird’s Eye View of Learner Corpus Research. In Granger, S., Hung, J., Petch-Tyson, S. Future efforts will focus on evaluating how rele- (eds.), Computer Learner Corpora, Second Lan- Nerius, D. et al.: Deutsche Orthographie. 4., neu guage Acquisition and Foreign Language Teach- bearbeitete Auflage. Hildesheim/Zürich/New York: ing, Amsterdam & Philadelphia: Benjamins, 2002, Olms Verlag, 2007. pp. 3-33. Okinina, N., Nicolas, L., Lyding, V.: Transc&Anno: Ehsan, N., Faili, H.: Grammatical and context- A Graphical Tool for the Transcription and On-the- sensitive error correction using a statistical ma- Fly Annotation of Handwritten Documents, Pro- chine translation framework, Software – Practice ceedings of the Eleventh International Conference and Experience, 2013, 43, pp. 187-206. on Language Resources and Evaluation (LREC), 2018, pp. 701-705. Grundkiewicz, R., Junczys-Dowmunt, M.: The WikEd Error Corpus: A Corpus of Corrective Wik- Reznicek, M., Lüdeling, A., Hirschmann, H.: Com- ipedia Edits and Its Application to Grammatical Er- peting Target Hypotheses in the Falko Corpus: A ror Correction. In Przepiórkowski, A., Ogrod- Flexible Multi-Layer Corpus Architecture, Auto- niczuk, M. (eds.), Advances in Natural Language matic Treatment and Analysis of Learner Corpus Processing. NLP 2014. Lecture Notes in Computer Data, John Benjamins Publishing Company, Am- Science, vol. 8686. Springer, Cham, 2014, pp. 478- sterdam/Philadelphia, 2013, pp. 101-123. 490. Reznicek, M., Lüdeling, A., Krummes, C., Hasan, M. M., Khaing, H. O.: Learner Corpus and its Schwantuschke, F.: Das Falko-Handbuch Application to Automatic Level Checking using Korpusaufbau und Annotationen, Version 2.0, Machine Learning Algorithms, Proceedings of 2012. ECTI-CON, 2008, pp. 25-28. Sakaguchi, K., Post, M., Van Durme, B.: Grammati- Ji, J., Wang, Q., Toutanova, K., Gong, Y., Truong, S., cal Error Correction with Neural Reinforcement Gao, J.: A Nested Attention Neural Hybrid Model Learning, Proceedings of the Eighth International for Grammatical Error Correction, ArXiv e-prints, Joint Conference on Natural Language Processing, 2017. Asian Federation of Natural Language Processing, Taipei, Taiwan, pp. 366–372. Junczys-Dowmunt, M., Grundkiewicz, R.: Phrase based machine translation is state-of-the-art for au- Schmaltz, A., Kim, Y., Rush, A., Shieber, S.: Adapt- tomatic grammatical error correction, Proceedings ing Sequence Models for Sentence Correction, of the 2016 Conference on Empirical Methods in Proceedings of the 2017 Conference on Empirical Natural Language Processing. Association for Methods in Natural Language Processing, 2017, Computational Linguistics, Austin, Texas, 2016, pp. 2807-2813. pp. 1546–1556. Stüker S., Fay, J., Berkling, K.: Towards Context- Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., dependent Phonetic Spelling Error Correction in Federico, M., Bertoldi, N., Cowan, B., Shen, W., Children’s Freely Composed Text for Diagnostic Moran, C., Zens, R., Dyer, C., Bojar, O., Constan- and Pedagogical Purposes, Interspeech, 2011. tin, A., Herbst, E.: Moses: Open source toolkit for Whitelaw, C., Hutchinson, B., Chung, G., Ellis, G.: statistical machine translation, Proceedings of ACL Using the Web for Language Independent Spell- ’07, Prague, Czech Republic, 2007, pp. 177–180. checking and Autocorrection, Proceedings of the Laarmann-Quante, R.: Towards a Tool for Automatic 2009 Conference on Empirical Methods in Natural Spelling Error Analysis and Feedback Generation Language Processing, Singapore, 2009, pp. 890- for Freely Written German Texts Produced by Pri- 899. mary School Children, Proceedings of the Seventh Yuan, Z., Briscoe, T.,: Grammatical Error Correction ISCA workshop on Speech and Language Technol- Using Neural Machine Translation, Proceedings of ogy in Education, 2017, pp. 36-41. NAACL-HLT 2016, 2016, pp. 380-386. Nakov, P., Tiedemann, J.: Combining Word-Level Zanasi, L., Stopfner, M.: Rilevare, osservare, consul- and Character-Level Models for Machine Transla- tare. Metodi e strumenti per l’analisi del plurilin- tion Between Closely-Related Languages, Pro- guismo nella scuola secondaria di primo grado. In ceedings of the 50th Annual Meeting of the Associa- Coonan, C., Bier, A., Ballarin, E., La didattica del- tion of Computational Linguistics (ACL), 2012, le lingue nel nuovo milennio. Le sfide pp. 301-305. dell’internazionalizzazione, Edizioni Ca’Foscari, Napoles, C., Sakaguchi, K., Tetreault, J.: JFLEG: A 2018, pp. 135-148. Fluency Corpus and Benchmark for Grammatical Error Corrections, Proceedings of the 15th Confer- ence of the European Chapter of the Association for Computational Linguistics, vol. 2, Short Papers. Association for Computational Linguistics, Valencia, Spain, 2017, pp. 229–234.