=Paper=
{{Paper
|id=None
|storemode=property
|title=Improvements to Korektor: A Case Study with Native and Non-Native Czech
|pdfUrl=https://ceur-ws.org/Vol-1422/73.pdf
|volume=Vol-1422
|dblpUrl=https://dblp.org/rec/conf/itat/RamasamyRS15
}}
==Improvements to Korektor: A Case Study with Native and Non-Native Czech==
J. Yaghob (Ed.): ITAT 2015 pp. 73–80 Charles University in Prague, Prague, 2015 Improvements to Korektor: A Case Study with Native and Non-Native Czech Loganathan Ramasamy1 , Alexandr Rosen2 , and Pavel Straňák1 1 Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics 2 Institute of Theoretical and Computational Linguistics, Faculty of Arts Charles University in Prague Abstract: We present recent developments of Korektor, pre-corrected text in terms of spelling. See also [15] and a statistical spell checking system. In addition to lexicon, Table 1 in [14] for what types of errors were subject to Korektor uses language models to find real-word errors, correction at the CoNLL 2013 and 2014 Shared Tasks on detectable only in context. The models and error proba- English as a Second Language. bilities, learned from error corpora, are also used to sug- We make no such optimistic expectations. As we show gest the most likely corrections. Korektor was originally in Section 2 there are many types of spelling errors both trained on a small error corpus and used language models in native speakers’ texts and in learner corpora. The error extracted from an in-house corpus WebColl. We show two distributions are slightly different, though. recent improvements: Richter [12] presented a robust spell checking system that includes language models for improved error detec- • We built new language models from freely avail- tion and suggestion. To improve the suggestions further, able (shuffled) versions of the Czech National Cor- the system employs error models trained on error cor- pus and show that these perform consistently better pora. In this paper we present some recent improvements on texts produced both by native speakers and non- to Richter et al.’s work in both respects: improved lan- native learners of Czech. guage models in Section 3 and task-dependent, adapted • We trained new error models on a manually annotated error models in Section 4. We apply native and non-native learner corpus and show that they perform better than error models on both native and non-native datasets in Sec- the standard error model (in error detection) not only tion 5. We analyze a portion of the systems output in Sec- for the learners’ texts, but also for our standard eval- tion 6 and provide some insight into the most problematic uation data of native Czech. For error correction, the errors that various models make. Finally, we summarize standard error model outperformed non-native mod- our work and list potential scope for further improvements els in 2 out of 3 test datasets. of Korektor components in Section 7. We discuss reasons for this not-quite-intuitive improve- ment. Based on these findings and on an analysis of errors 2 Error Distribution for Native vs in both native and learners’ Czech, we propose directions Non-Native Czech for further improvements of Korektor. Richter [11, p. 33] presents statistics of spelling errors in Czech, based on a small corpus of 9500 words, which is 1 Introduction actually a transcript of an audio recording of a novel. The The idea of using the context of a misspelled word to im- transcription was done by a native speaker. Following [1], prove the performance of a spell checker is not new [10]. the error analysis in Table 1 is based on the classification Moreover, recent years have seen the advance of context- of errors into four basic groups: substitution, insertion, aware spell checkers such as Google Suggest, offering rea- deletion/omission and swap/transposition/metathesis. Al- sonable corrections of search queries. though the figures may be biased due to the small size of Methods used in such spell checkers usually employ the the corpus and the fact that it was transcribed by a sin- noisy-channel or window-based approach [4]. The system gle person, we still find them useful for a comparison with described here also belongs to the noisy-channel class. It statistics of spelling errors made by non-native speakers. makes extensive use of language models based on several In Table 2 the aggregate figures from Table 1 (in morphological factors, exploiting the morphological rich- the last column headed by “Native”) are compared with ness of the target language. figures from an automatically corrected learner corpus Errors detected by such advanced spell checkers have a (“SGT”, or CzeSL-SGT) and a hand-corrected learner cor- natural overlap with those of rule-based grammar check- pus (“MAN”, or CzeSL-MAN). The taxonomy of errors is ers – grammatical errors are also manifested as unlikely derived from a “formal error classification” used in those n-grams. Using language models or even complete SMT two corpora, described briefly in Section 4.1 In this ta- approach [8] for grammatical error correction is also be- ble we follow [3] in treating errors in diacritics as dis- coming more common, however all the tasks and publica- 1 See [7] for more details about the classification and the http://utkl. tions on grammar correction we have seen so far expect ff.cuni.cz/learncorp/ site, including all information about the corpora. 74 L. Ramasamy, A. Rosen, P. Straňák Error Type Frequency Percentage Error type L1 L2 Substitution 224 40.65% Substitution3 22,695 84.36% 30,527 84.15% – horizontally adjacent letters 142 25.77% – Case 1,827 8.05% 5,090 16.67% – vertically adjacent letters 2 0.36% – Diacritics 14,426 63.56% 13,367 43.79% –z→s 6 1.09% Insertion 1,274 4.74% 1,800 4.96% –s→z 1 0.18% Deletion 2,862 10.64% 3,809 10.50% –y→i 10 1.81% Swap 72 0.27% 143 0.39% –i→y 10 1.81% Total 26,903 100.00% 36,279 100.00% – non-adjacent vocals 13 2.36% – diacritic confusion 21 3.81% Table 3: Distribution of single edit errors in the training – other cases 19 3.45% portion of the CzeSL-MAN corpus on Levels 1 and 2 Insertion 235 42.65% – horizontally adjacent letter 162 29.40% Substituting... Frequency Substituting... Frequency – vertically adjacent letter 13 2.36% a for á 5255 y for ý 780 – same letter as previous 14 2.54% i for í 3427 á for a 695 – other cases 46 8.35% e for ě 1284 u for ů 635 Deletion – other cases 58 10.53% e for é 1169 y for i 482 Swap letters 34 6.17% i for y 1077 í for ý 330 TOTAL 551 100.00% í for i 1005 z for ž 297 Table 1: Error types in a Czech text produced by native Table 4: The top 12 most frequent substitution errors in speakers the CzeSL corpus SGT MAN PT Native rected, while L2 is the level where real-word errors are Insertion 3.76 3.52 10.45 42.65 corrected (words correct out of context but incorrect in the Omission 1.39 9.20 17.12 10.53 syntactic context). For more details about CzeSL-MAN Substitution 31.30 37.67 12.82 36.84 see Section 4.1. Transposition 0.16 0.19 3.69 6.17 As an illustration of the prevalence of errors in diacritics in non-native Czech, see Table 4, showing the 12 most Missing diacritic 50.19 40.40 37.66 frequent substitution errors from L1 in Table 3. There is Addition of diacritic 12.69 8.60 1.67 only one error which is not an error in a diacritic (the use Wrong diacritic 0.51 0.43 0.92 3.81 of the i homophone instead of y). Table 2: Percentages of error types in a Czech text pro- duced by non-native speakers, compared to Portuguese 3 Current Improvements for Native Czech and Czech native speakers Spelling Correction The original language model component of Korektor [12] tinct classes, adding their statistics on native Brazilian Por- was trained on WebColl – a 111 million words corpus of tuguese for comparison in the “PT” column. primarily news articles from the web. This corpus has two The high number of errors in diacritics in non-native issues: (i) the texts are not representative and (ii) the lan- Czech and native Portuguese in comparison with native guage model from this data could not be distributed freely Czech can be explained by the fact that native speakers of due to licensing issues. To obviate this, we evaluate Ko- Czech are aware of the importance of diacritics both for rektor using two new language models built from two cor- distinguishing the meaning and for giving the text an ap- pora available from the Czech National Corpus (CNC): propriate status. The high number of errors in diacritics (i) SYN2005 [2] and (ii) SYN2010 [9]. Both have the in learner texts is confirmed by results shown in Table 3, size of 100 million words each and have a balanced rep- counted on the training portion of the “CzeSL-MAN” cor- resentation of contemporary written Czech: news, fiction, pus by comparing the uncorrected and corrected forms, professional literature etc. restricted to single-edit corrections.2 The distribution is We use the error model and the test data (only the Audio shown separately for the two annotation levels of CzeSL- data set) described in [12]. Audio contains 1371 words MAN: somewhat simplifying, L1 is the level where non- with 218 spelling errors, of which 12 are real-word errors. words (forms spelled incorrectly in any context) are cor- 3 The two error types below are actually subtypes of the substitution 2 I.e., without using the “formal error types” of [7]. error. Improvements to Korektor: A Case Study with Native and Non-Native Czech 75 For the CNC corpora, we build 3rd order language models LM train data Max. edit distance P R F1 using KenLM [6]. WebColl 94.7 90.8 92.7 The spell checker accuracy is measured in terms of stan- SYN2005 1 95.7 90.8 93.2 dard precision and recall. The precision and recall mea- SYN2010 94.7 89.9 92.2 sures are calculated at two levels: (i) error detection and (ii) error correction. These evaluation measures are simi- WebColl 94.1 95.4 94.8 lar in spirit as in [17]. For both levels, precision, recall and SYN2005 2 95.0 95.9 95.4 other related measures are calculated as: Precision(P) = SYN2010 94.1 95.0 94.5 TP TP 2∗P∗R T P+FP , Recall(R) = T P+FN , and F − score(F1) = P+R , WebColl 94.1 95.4 94.8 where, for error detection, SYN2005 3 95.0 95.9 95.4 SYN2010 94.1 95.0 94.5 • TP – Number of words with spelling errors that the spell checker detected correctly WebColl 94.1 95.4 94.8 SYN2005 4 95.0 95.9 95.4 • FP – Number of words identified as spelling errors SYN2010 94.1 95.0 94.5 that are not actually spelling errors WebColl 94.1 95.4 94.8 • TN – Number of correct words that the spell checker SYN2005 5 95.0 95.9 95.4 did not flag as having spelling errors SYN2010 94.1 95.0 94.5 • FN – Number of words with spelling errors that the Table 5: Error detection results with respect to different spell checker did not flag as having spelling errors language models and for error correction, • TP – Number of words with spelling errors for which 4 Work in Progress for Improving Spelling the spell checker gave the correct suggestion Correction of Non-Native Czech • FP – Number of words (with/without spelling errors) One of the main hurdle in obtaining a new error model is for which the spell checker made suggestions, and for the availability of annotated error data for training. Many those, either the suggestion is not needed (in the case approaches are available to somehow obtain error data au- of non-existing errors) or the suggestion is incorrect tomatically from sources such as the web [16]. The error if indeed there was an error in the original word data obtained from the web may be good enough for han- • TN – Number of correct words that the spell checker dling simple typing errors, but not for the more compli- did not flag as having spelling errors and no sugges- cated misspellings a learner/non-native speaker of a lan- tions were made guage makes. However, these approaches can be success- fully used to obtain general purpose spell checkers. One • FN – Number of words with spelling errors that the resource which could be of some value to spell checking is spell checker did not flag as having spelling errors or the learner corpus. Unlike native error corpus, the learner did not provide any suggestions corpus of non-native or foreign speakers tend to have more The results for error detection and error correction are errors ranging from orthographical, morphological to real- shown in Tables 5 and 6, respectively. Maximum edit dis- word errors. In this work, we try to address whether error tance, i.e., the number of edit operations per word is set to models from texts produced by native Czech speakers can values from 1 to 5. In the case of error detection, the best be applied to errors from non-native Czech texts and vice overall performance is obtained for the SYN2005 corpus versa. We also derive error analysis based on the results. when the maximum edit distance parameter is 2, and there is no change in results for the edit distance range from 3 to 5. Of the two CNC corpora, SYN2005 consistently pro- 4.1 CzeSL — a Corpus of Czech as a Second vides better results than SYN2010 corpus. Differences in Language the vocabulary could be the most likely reason. Even in the case of error correction, the best overall per- A learner corpus consists of language produced by lan- formance is obtained for SYN2005 with 94.5% F1-score. guage learners, typically learners of a second or foreign We can also see that WebColl performs better in 3 out of language. Deviant forms and expressions can be corrected 5 cases, but we should also note that this happens when and/or annotated by tags making the nature of the error ex- we include top-3 suggestions in the error correction. Oth- plicit. The annotation scheme in CzeSL is based on a two- erwise the SYN2005 model consistently provides better stage annotation design, consisting of three levels. The scores. We have also experimented with pruned language level of transcribed input (Level 0) is followed by the level models and obtained similar results. of orthographical and morphological corrections (Level 1), where only forms incorrect in any context are treated. The 76 L. Ramasamy, A. Rosen, P. Straňák Max. edit top-1 top-2 top-3 LM train data distance P R F1 P R F1 P R F1 WebColl 85.2 89.9 87.5 90.9 90.5 90.7 93.3 90.7 92.0 SYN2005 1 87.9 90.1 89.0 92.3 90.5 91.4 93.7 90.7 92.2 SYN2010 86.0 89.0 87.5 91.8 89.6 90.7 92.3 89.7 91.0 WebColl 84.2 94.9 89.2 91.0 95.3 93.1 93.2 95.4 94.3 SYN2005 2 86.8 95.5 91.0 91.8 95.7 93.7 93.2 95.8 94.5 SYN2010 85.0 94.4 89.5 91.4 94.8 93.1 92.3 94.9 93.5 WebColl 84.2 94.9 89.2 91.0 95.3 93.1 93.2 95.4 94.3 SYN2005 3 86.8 95.5 91.0 91.4 95.7 93.5 92.7 95.8 94.2 SYN2010 85.0 94.4 89.5 90.9 94.8 92.8 91.8 94.8 93.3 WebColl 84.2 94.9 89.2 91.0 95.3 93.1 93.2 95.4 94.3 SYN2005 4 86.8 95.5 91.0 91.4 95.7 93.5 92.7 95.8 94.2 SYN2010 85.0 94.4 89.5 90.9 94.8 92.8 91.8 94.8 93.3 WebColl 84.2 94.9 89.2 91.0 95.3 93.1 93.2 95.4 94.3 SYN2005 5 86.8 95.5 91.0 91.4 95.7 93.5 92.7 95.8 94.2 SYN2010 85.0 94.4 89.5 90.9 94.8 92.8 91.8 94.8 93.3 Table 6: Error correction results with respect to different language models result is a string consisting of correct Czech forms, even and corrected forms, were assigned. The share of ‘out of though the sentence may not be correct as a whole. All lexicon’ forms, as detected by the tagger, is slightly lower other types of errors are corrected at Level 2.4 – 9.23%. This annotation scheme was meant to be used by hu- man annotators. However, the size of the full corpus and 4.2 The CzeSL-MAN Error Models the costs of its manual annotation have led us to apply au- We built two error models from the CzeSL-MAN corpus tomatic annotation and find ways of its improvement. – one for Level 1 (L1) errors and another for Level 2 (L2) The hand-annotated part of the corpus (CzeSL-MAN) errors. As explained in Section 4.1 above, L1 errors are now consists of 294 thousand word tokens in 2225 short mainly non-word errors and L2 errors belong to real-word essays, originally hand-written and transcribed.5 A part of and grammatical errors, but still include form errors that the corpus is annotated independently by two annotators: are not corrected at L1 because the faulty form happens to 121 thousand word tokens in 955 texts. The authors are be spelled as a form which would be correct in a differ- both foreign learners of Czech and Czech learners whose ent context. Extracting errors from the XML format used first language is the Romani ethnolect of Czech. for encoding the original and the corrected text at L1 is The entire CzeSL corpus (CzeSL-PLAIN) includes straightforward. The only thing needed is to follow the about 2 mil. word tokens. This corpus comprises tran- links connecting tokens at L0 (the original tokens) and L1 scripts of essays of foreign learners and Czech students (the corrected tokens) and to extract tokens for which the with the Romani background, and also Czech Bachelor links are labeled as correction links. In the error extraction and Master theses written by foreigners. process, we do not extract errors that involve joining or The part consisting of essays of foreign learners only in- splitting of word tokens at either level (Korektor does not cludes about 1.1 word tokens. It is available as the CzeSL- handle incorrectly split or joined words at the moment). SGT corpus with full metadata and automatic annotation, L2 errors include not only errors identified between L1 including corrections proposed by Korektor, using the and L2 but also those identified already between L0 and original language model trained on the WebColl corpus.6 L1, if any. This is because L2 tokens are linked to L0 In the annotation Korektor detected and corrected 13.24% tokens through L1 tokens, rather than being linked di- incorrect forms, 10.33% labeled as including a spelling rectly. For example, consider a single token at Levels L0, error, and 2.92% an error in grammar, i.e. a ‘real-word’ f ormSingCh,incorBase error. Both the original, uncorrected texts and their cor- L1 and L2: všechy (L0) −−−−−−−−−−−−→ všechny (L1) agr rected version was tagged and lemmatized, and “formal −−→ všichni (L2). The arrow stands for a link between error tags,” based on the comparison of the uncorrected the two levels, optionally with one or more error labels. 4 See [5] and [13] for more details. For the L1 error extraction, the extracted pair of an incor- 5 For an overview of corpora built as a part of the CzeSL project and rect token and a correct token is (všechy, všechny) with the relevant links see http://utkl.ff.cuni.cz/learncorp/. error labels (formSingCh, incorBase), and for the L2 er- 6 See http://utkl.ff.cuni.cz/~rosen/public/2014-czesl-sgt-en.pdf. ror extraction, the extracted error and correct token pair is Improvements to Korektor: A Case Study with Native and Non-Native Czech 77 CzeSL-L1 CzeSL-L2 Error Train data Corpus size #Errors train test train test WebColl 111M 12,761 single-edit 73.54 72.24 67.02 69.30 CzeSL-L1 383K 36,584 multi-edit 26.46 27.76 32.98 30.70 CzeSL-L2 370K 54,131 Table 7: Percentage of single and multi edit-distance er- rors in the train/test of L1 and L2 errors. Table 8: Training data for native and non-native experi- ments. The errors include both single and multi-edit er- rors. (všechy, všichni) with the error labels (formSingCh, incor- Base, agr). For the L2 errors, we project the error labels of L1 onto L2. If there is no error present or annotated Test data Corpus size #Errors between L0 and L1, then we use the error annotation be- Audio 1,371 218 tween L2 and L1. The extracted incorrect token is still from L0 and the correct token from L2. CzeSL-L1 33,169 3,908 Many studies have shown that most misspellings are CzeSL-L2 32,597 5,217 single-edit errors, i.e., misspelled words differ from their correct spelling by exactly one letter. This also holds for Table 9: Test set for native and non-native experiments. our extracted L1 and L2 errors (Table 7). We train our L1 The errors include both single and multi-edit errors. and L2 errors on single-edit errors only, thus the models are quite similar to the native Czech error model described as capitalization or keyboard layouts, so there is still some in [11]. The error training is based on [1]. Error proba- scope for improvements on the non-native error models. bilities are calculated for the four single-edit operations: While webcoll and czesl_L2 models help each other in the substitution, insertion, deletion, and swap. opposite direction, i.e., the performance of native model on the non-native data and vice versa, the czesl_L1 model 5 Experiments with Native and Non-Native works better only on the CzeSL-L1 dataset. In other words, since L1 error annotation did not involve complete Error Models correction of the test data of CzesL-MAN, they can be used, for instance, the correction of misspellings that do For the native error model (webcoll), we use the same not involve grammar errors. model as described in [12]. For the non-native error mod- els, we create two error models as described in Section 4.2: (i) czesl_L1 – trained on the L1 errors (CzeSL-L1 data 6 Discussion in Table 8) and (ii) czesl_L2 – trained on the L2 errors (CzeSL-L2 data in Table 8). We partition the CzeSL-MAN We manually analyzed a part (the top 3000 tokens) of the corpus in the 9:1 proportion for training and testing. output of Korektor for the CzeSL-L2 test data for all the The non-native training data include more errors than three models. We broadly classify the test data as hav- those automatically mined from web. The training of non- ing form errors (occurring between the L0 and L1 level), native error models is done on single-edit errors only (refer grammar (gram) errors (occurring between L1 and L2) and Table 7 for the percentage of errors used for training). For accumulated errors (form+gram, where errors are present the language model, we use the best model (SYN2005) at all levels – between L0 and L1, and L1 and L2). The that we obtained from Section 3. CzeSL-L2 test data can include any of the above types We perform evaluation on all kinds of errors in test data. of errors. About 23% of our analyzed data include one We also set the maximum edit distance parameter to 2 for of the above errors. More than half of the errors (around all our experiments. We arrived at this value based on our 62%) belong to the form errors and about 27% belong to observation in various experiments. We run our native and the gram class. The remaining errors are the form+gram non-native models on the test data described in Table 9, errors. and their results are given in Table 10. Error correction In the case of form errors, both the native (webcoll) and results are shown for top-3 suggestions. the non-native models (czesl_L1 and czesl_L2) detect er- In error detection, in terms of F1-score, czesl_L2 model rors at the rate of more than 89%. Form errors may or posts better score than the other two models for both native may not be systematic and they are easily detected by all and non-native data sets. When it comes to error correc- the three models. Most of the error instances in the data tion, the native model webcoll seems to perform better in can be categorized under either missing/addition of dia- 2 out of 3 data sets, and the next better performer being critics, or they can occur in combination with other types the czesl_L2 model. One has to note that, the non-native of errors, for instance, přítelkyně was incorrectly written models are not tuned to any particular phenomenon such as přatelkine. 78 L. Ramasamy, A. Rosen, P. Straňák Error detection Error correction Model Audio CzeSL-L1 CzeSL-L2 Audio CzeSL-L1 CzeSL-L2 P R F1 P R F1 P R F1 P R F1 P R F1 P R F1 webcoll 95.0 95.9 95.4 81.8 81.7 81.7 91.0 65.0 75.9 93.2 95.8 94.5 71.7 79.6 75.4 78.0 61.5 68.8 czesl_L1 95.0 96.8 95.9 82.2 82.2 82.2 91.1 64.4 75.4 93.7 96.7 95.2 70.2 79.8 74.7 75.5 60.0 66.8 czesl_L2 95.0 96.8 95.9 81.2 82.7 81.9 90.9 65.4 76.1 93.7 96.7 95.2 68.2 80.0 73.6 74.9 60.9 67.2 Table 10: Error models applied to native and non-native Czech Error label: "form:formCaron0 + formSingCh incorrect usage correct usage category gloss + formY0 + incorBase + incorInfl" bavímSG bavímePL number enjoy Error token: přatelkine bylSG bylyPL number was → were Gold token: přítelkyně bylSG BylyPL number was → were webcoll: přatelkine Chci1 ST Chce3 RD person want → wants czesl_L1:ChtělaFEM ChtělMASC gender wanted czesl_L2: dobré FEM dobří MASC . ANIM gender good In the case of gram errors, most of the errors are unde- dobrý MASC dobráFEM gender good tected. Out of 193 gram errors in our analyzed data, the druhý NOM druhéhoGEN case 2nd, other hezké PL hezký SG number nice percentage of errors detected by the models are: webcoll jeSG jsouPL number is → are (15.5%), czesl_L1 (9.3%) and czesl_L2 (15.0%). Most of jednouINS jedné LOC case one the grammar errors involve agreement, dependency and jichGEN jeACC case them lexical errors. The agreement errors are shown in Table 11. jsemSG jsmePL number am → are Except for a few pairs such as jedné → jednou (incorrect jsmePL jsemSG number are → am → correct), mě → mé, který → kteří, teplí → teplý, most jsouPL jeSG number are → is of the error tokens involving agreement errors have not který SG kteří PL number which been recognized by any of the three models.7 leželiMASC . ANIM leželyFEM gender lay Dependency errors (e.g. a wrongly assigned morpho- malý SG malé PL number small logical case, missing a syntactic governor’s valency re- malýchGEN malé ACC case small quirement) such as rokuGEN → roceLOC ‘year’, kolejACC mé ACC mí NOM number my → kolejiLOC ‘dormitory’, rokuSG → rokyPL ‘year’, restau- MěPERS . PRON Mé POSS . PRON POS me → my raciLOC → restauraceNOM ‘restaurant’ have not been rec- miluju1 ST miluje3 RD person love → loves ognized by any of the models. The pair miDAT → měACC mohliMASC . ANIM mohlyFEM gender could ‘me’ has been successfully recognized by all the three nemocní PL nemocný SG number ill models and the correct suggestion listed in the top: nichLOC něACC case them oslaviliMASC . ANIM oslavilaNEUT gender celebrated Error label: "gram:dep" pracovní NOM pracovnímINS case work-related Error token: mi pracuji1 ST pracuje3 RD person work → wants Gold token: mě StudovaliMASC . ANIM studovalyFEM gender studied webcoll: tepleADV cc teplé ADJ POS warmly → warm czesl_L1: teplí PL teplý SG number warm czesl_L2: tří GEN třiACC case three tuhleFEM TenhleMASC gender this For instance, the pair ve → v ‘in’ (vocalized → unvo- typické FEM typickáNEUT gender typical calized) has been recognized by the webcoll and czesl_L2 velké PL velký SG number big models, but not by the czesl_L1 model. When it comes to grammar errors, webcoll and czesl_L2 have better per- Table 11: Some of the agreement errors in the analyzed formance than czesl_L1. It was expected, because the portion of the CzeSL-L2 test data czesl_L1 model was not trained on grammar errors. When the error involved a combination of form and gram errors, all the three models tend to perform bet- ter. Most of the form+gram errors were recognized by all the three models: webcoll (85%), czesl_L1 (86%) and 7 The category glosses should be taken with a grain of salt: many czesl_L2 (89%). For instance, the error pair *zajímavy → zajímavé ‘interesting’ that was labeled at both L1 and L2 forms can have several interpretations. E.g. oslaviliMASC . ANIM → oslav- ilaNEUT ‘celebrated’ could also be glossed as oslaviliPL , MASC . ANIM → level was successfully recognized by all the models, and oslavilaSG , FEM . the correct suggestions were listed in the top. There were Improvements to Korektor: A Case Study with Native and Non-Native Czech 79 many errors that were successfully recognized, but the cor- further, we would like to investigate how the more com- rect suggestions did not appear in top-3, such as, *nechcí plex grammar errors such as those in agreement and form → nechtěl ‘didn’t want’, *mym → svým ‘my’, *kamarad errors such as joining/splitting of word forms can be mod- → kamaráda ‘friend’, *vzdělany → vzdělaná ‘educated’. eled. Further, we would like to analyze non-native Czech Based on the results in Table 10 and the manual error models, so that Korektor can be used to annotate a large analysis in this section, we can make the following general Czech learner corpus such as CzeSL-SGT more reliably. observations: • Non-native Czech models can be applied to native References test data and obtain even better results than the na- tive Czech model (Table 10). [1] Church, K., Gale, W.: Probability scoring for spelling cor- rection. Statistics and Computing 1(7) (1991), 93–103 • From the manual analysis of the test outputs of both native and non-native Czech models, the most prob- [2] Čermák, F., Hlaváčová, J., Hnátková, M., Jelínek, T., Ko- cek, J., Kopřivová, M., Křen, M., Novotná, R., Petkevič, V., lematic errors are the grammar errors due to missed Schmiedtová, V., Skoumalová, H., Spoustová, J., Šulc, M., agreement or government (valency requirements). Velíšek, Z.: SYN2005: a balanced corpus of written Czech, Some of the grammar errors involve most commonly 2005 occurring Czech forms such as jsme, byl, dobrý, je, [3] Gimenes, P. A., Roman, N. T., Carvalho, A. M. B. R.: druhý. Spelling error patterns in Brazilian Portuguese. Compu- tational Linguistics 41(1) (2015), 175–183 • Both native and non-native error models perform well [4] Golding, A. R., Roth, D.: A window-based approach to on spelling-only errors. context-sensitive spelling correction. Machine Learning 34 • The CzeSL-MAN error data include errors that in- (1999), 107–130 10.1023/A:1007545901558. volve joining/splitting of word forms that we did not [5] Hana, J., Rosen, A., Škodová, S., Štindlová, B.: Error- handle in our experiments. We also skipped word or- tagged learner corpus of Czech. In: Proceedings of the Fourth Linguistic Annotation Workshop, Uppsala, Sweden, der issues in the non-native errors which are beyond Association for Computational Linguistics, 2010 the scope of current spell checker systems. [6] Heafield, K.: KenLM: faster and smaller language model queries. In: Proceedings of the EMNLP 2011 Sixth Work- shop on Statistical Machine Translation, 187–197, Edin- 7 Conclusions and Future Work burgh, Scotland, United Kingdom, 2011 [7] Jelínek, T., Štindlová, B., Rosen, A., Hana, J.: Com- We have tried to improve both the language model and bining manual and automatic annotation of a learner cor- the error model component of Korektor, a Czech statisti- pus. In: Sojka, P., Horák, A., Kopeček, I., Pala, K., (eds.), cal spell checker. Language model improvements involved Text, Speech and Dialogue – Proceedings of the 15th In- the employment of more balanced corpora from the Czech ternational Conference TSD 2012, number 7499 in Lecture National Corpus, namely SYN2005 and SYN2010. We Notes in Computer Science, 127–134, Springer, 2012 obtained better results for the SYN2005 corpus. [8] Junczys-Dowmunt, M., Grundkiewicz, R.: The AMU Sys- Error model improvements involved creating non-native tem in the CoNLL-2014 Shared Task: Grammatical er- error models from CzeSL-MAN, a hand-annotated Czech ror correction by data-intensive and feature-rich statistical learner corpus, and a series of experiments with native and machine translation. In: Proceedings of the Eighteenth non-native Czech data sets. The state-of-the-art improve- Conference on Computational Natural Language Learning: ment for the native Czech data set comes from the non- Shared Task, 25–33, Baltimore, Maryland, Association for Computational Linguistics, 2014 native Czech models trained on L1 and L2 errors from CzeSL-MAN. Surprisingly, the native Czech model per- [9] Křen, M., Bartoň, T., Cvrček, V., Hnátková, M., Jelínek, T., Kocek, J., Novotná, R., Petkevič, V., Procházka, P., formed better for non-native Czech (L2 data) than the Schmiedtová, V., Skoumalová, H.: SYN2010: a balanced non-native models. This we attribute to the rich source corpus of written Czech, 2010 of learner error data, since the texts come from very dif- [10] Mays, E., Damerau, F. J., Mercer, R. L.: Context based ferent texts: Czech students with Romani background, as spelling correction. Information Processing & Manage- well as learners with various proficiency levels and first ment 27 (5) (1991), 517–522 languages. Another potential reason could be the untuned [11] Richter, M.: An advanced spell checker of Czech. Mas- nature of the non-native error models that may require fur- ter’s Thesis, Faculty of Mathematics and Physics, Charles ther improvement. University, Prague, 2010 As for future work aimed at further improvements of [12] Richter, M., Straňák, P., Rosen, A.: Korektor — a system Korektor, we plan to explore model combinations with na- for contextual spell-checking and diacritics completion. In: tive and non-native Czech models. We would also like to Proceedings of the 24th International Conference on Com- extend Korektor to cover new languages so that more anal- putational Linguistics (Coling 2012), 1019–1027, Mumbai, ysis results could be obtained. To improve error models India, (2012), Coling 2012 Organizing Committee 80 L. Ramasamy, A. Rosen, P. Straňák [13] Rosen, A., Hana, J., Štindlová, B., Feldman, A.: Eval- uating and automating the annotation of a learner corpus. Language Resources and Evaluation — Special Issue: Re- sources for language learning 48 (1) (2014), 65–92 [14] Rozovskaya, A., Chang, K. -W., Sammons, M., Roth, D., Habash, N.: The Illinois-Columbia System in the CoNLL- 2014 Shared Task. In: CoNLL Shared Task, 2014 [15] Rozovskaya, A., Roth, D.: Building a state-of-the-art grammatical error correction system, 2014 [16] Whitelaw, C., Hutchinson, B., Chung, G. Y., Ellis, G.: Us- ing the web for language independent spellchecking and autocorrection. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing – Volume 2, EMNLP’09, 890–899, Stroudsburg, PA, USA, Association for Computational Linguistics, 2009 [17] Wu, S. -H., Liu, C. -L., Lee, L. -H.: Chinese spelling check evaluation at SIGHAN Bake-off 2013. In: Proceedings of the Seventh SIGHAN Workshop on Chinese Language Pro- cessing, 35–42, Nagoya, Japan, Asian Federation of Natu- ral Language Processing, 2013