=Paper=
{{Paper
|id=None
|storemode=property
|title=Improvements to Korektor: A Case Study with Native and Non-Native Czech
|pdfUrl=https://ceur-ws.org/Vol-1422/73.pdf
|volume=Vol-1422
|dblpUrl=https://dblp.org/rec/conf/itat/RamasamyRS15
}}
==Improvements to Korektor: A Case Study with Native and Non-Native Czech==
<pdf width="1500px">https://ceur-ws.org/Vol-1422/73.pdf</pdf>
<pre>
J. Yaghob (Ed.): ITAT 2015 pp. 73–80
Charles University in Prague, Prague, 2015


  Improvements to Korektor: A Case Study with Native and Non-Native Czech

                                 Loganathan Ramasamy1 , Alexandr Rosen2 , and Pavel Straňák1
                           1 Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics
                                2 Institute of Theoretical and Computational Linguistics, Faculty of Arts

                                                     Charles University in Prague

Abstract: We present recent developments of Korektor,                  pre-corrected text in terms of spelling. See also [15] and
a statistical spell checking system. In addition to lexicon,           Table 1 in [14] for what types of errors were subject to
Korektor uses language models to find real-word errors,                correction at the CoNLL 2013 and 2014 Shared Tasks on
detectable only in context. The models and error proba-                English as a Second Language.
bilities, learned from error corpora, are also used to sug-               We make no such optimistic expectations. As we show
gest the most likely corrections. Korektor was originally              in Section 2 there are many types of spelling errors both
trained on a small error corpus and used language models               in native speakers’ texts and in learner corpora. The error
extracted from an in-house corpus WebColl. We show two                 distributions are slightly different, though.
recent improvements:                                                      Richter [12] presented a robust spell checking system
                                                                       that includes language models for improved error detec-
    • We built new language models from freely avail-
                                                                       tion and suggestion. To improve the suggestions further,
      able (shuffled) versions of the Czech National Cor-
                                                                       the system employs error models trained on error cor-
      pus and show that these perform consistently better
                                                                       pora. In this paper we present some recent improvements
      on texts produced both by native speakers and non-
                                                                       to Richter et al.’s work in both respects: improved lan-
      native learners of Czech.
                                                                       guage models in Section 3 and task-dependent, adapted
    • We trained new error models on a manually annotated              error models in Section 4. We apply native and non-native
      learner corpus and show that they perform better than            error models on both native and non-native datasets in Sec-
      the standard error model (in error detection) not only           tion 5. We analyze a portion of the systems output in Sec-
      for the learners’ texts, but also for our standard eval-         tion 6 and provide some insight into the most problematic
      uation data of native Czech. For error correction, the           errors that various models make. Finally, we summarize
      standard error model outperformed non-native mod-                our work and list potential scope for further improvements
      els in 2 out of 3 test datasets.                                 of Korektor components in Section 7.
We discuss reasons for this not-quite-intuitive improve-
ment. Based on these findings and on an analysis of errors             2    Error Distribution for Native vs
in both native and learners’ Czech, we propose directions                   Non-Native Czech
for further improvements of Korektor.
                                                                       Richter [11, p. 33] presents statistics of spelling errors in
                                                                       Czech, based on a small corpus of 9500 words, which is
1     Introduction
                                                                       actually a transcript of an audio recording of a novel. The
The idea of using the context of a misspelled word to im-              transcription was done by a native speaker. Following [1],
prove the performance of a spell checker is not new [10].              the error analysis in Table 1 is based on the classification
Moreover, recent years have seen the advance of context-               of errors into four basic groups: substitution, insertion,
aware spell checkers such as Google Suggest, offering rea-             deletion/omission and swap/transposition/metathesis. Al-
sonable corrections of search queries.                                 though the figures may be biased due to the small size of
   Methods used in such spell checkers usually employ the              the corpus and the fact that it was transcribed by a sin-
noisy-channel or window-based approach [4]. The system                 gle person, we still find them useful for a comparison with
described here also belongs to the noisy-channel class. It             statistics of spelling errors made by non-native speakers.
makes extensive use of language models based on several                   In Table 2 the aggregate figures from Table 1 (in
morphological factors, exploiting the morphological rich-              the last column headed by “Native”) are compared with
ness of the target language.                                           figures from an automatically corrected learner corpus
   Errors detected by such advanced spell checkers have a              (“SGT”, or CzeSL-SGT) and a hand-corrected learner cor-
natural overlap with those of rule-based grammar check-                pus (“MAN”, or CzeSL-MAN). The taxonomy of errors is
ers – grammatical errors are also manifested as unlikely               derived from a “formal error classification” used in those
n-grams. Using language models or even complete SMT                    two corpora, described briefly in Section 4.1 In this ta-
approach [8] for grammatical error correction is also be-              ble we follow [3] in treating errors in diacritics as dis-
coming more common, however all the tasks and publica-                      1 See [7] for more details about the classification and the http://utkl.

tions on grammar correction we have seen so far expect                 ff.cuni.cz/learncorp/ site, including all information about the corpora.
74                                                                                                                L. Ramasamy, A. Rosen, P. Straňák


     Error Type                             Frequency          Percentage       Error type                   L1                       L2
     Substitution                                   224          40.65%         Substitution3     22,695          84.36%    30,527         84.15%
     – horizontally adjacent letters                142          25.77%         – Case             1,827           8.05%     5,090         16.67%
     – vertically adjacent letters                    2           0.36%         – Diacritics      14,426          63.56%    13,367         43.79%
     –z→s                                             6           1.09%         Insertion          1,274           4.74%     1,800          4.96%
     –s→z                                             1           0.18%         Deletion           2,862          10.64%     3,809         10.50%
     –y→i                                            10           1.81%         Swap                  72           0.27%       143          0.39%
     –i→y                                            10           1.81%         Total             26,903      100.00%       36,279         100.00%
     – non-adjacent vocals                           13           2.36%
     – diacritic confusion                           21           3.81%     Table 3: Distribution of single edit errors in the training
     – other cases                                   19           3.45%     portion of the CzeSL-MAN corpus on Levels 1 and 2
     Insertion                                      235          42.65%
     – horizontally adjacent letter                 162          29.40%         Substituting...   Frequency         Substituting...    Frequency
     – vertically adjacent letter                    13           2.36%              a for á          5255             y for ý              780
     – same letter as previous                       14           2.54%              i for í          3427             á for a              695
     – other cases                                   46           8.35%              e for ě         1284             u for ů             635
     Deletion – other cases                          58          10.53%              e for é          1169             y for i              482
     Swap letters                                    34            6.17%             i for y          1077             í for ý              330
     TOTAL                                          551         100.00%              í for i          1005             z for ž              297


Table 1: Error types in a Czech text produced by native                     Table 4: The top 12 most frequent substitution errors in
speakers                                                                    the CzeSL corpus


                                  SGT       MAN         PT        Native    rected, while L2 is the level where real-word errors are
     Insertion                    3.76       3.52      10.45       42.65    corrected (words correct out of context but incorrect in the
     Omission                     1.39       9.20      17.12       10.53    syntactic context). For more details about CzeSL-MAN
     Substitution                31.30      37.67      12.82       36.84    see Section 4.1.
     Transposition                0.16       0.19       3.69        6.17       As an illustration of the prevalence of errors in diacritics
                                                                            in non-native Czech, see Table 4, showing the 12 most
     Missing diacritic           50.19      40.40      37.66
                                                                            frequent substitution errors from L1 in Table 3. There is
     Addition of diacritic       12.69       8.60       1.67
                                                                            only one error which is not an error in a diacritic (the use
     Wrong diacritic              0.51       0.43       0.92        3.81    of the i homophone instead of y).
Table 2: Percentages of error types in a Czech text pro-
duced by non-native speakers, compared to Portuguese                        3        Current Improvements for Native Czech
and Czech native speakers                                                            Spelling Correction
                                                                            The original language model component of Korektor [12]
tinct classes, adding their statistics on native Brazilian Por-
                                                                            was trained on WebColl – a 111 million words corpus of
tuguese for comparison in the “PT” column.
                                                                            primarily news articles from the web. This corpus has two
   The high number of errors in diacritics in non-native
                                                                            issues: (i) the texts are not representative and (ii) the lan-
Czech and native Portuguese in comparison with native
                                                                            guage model from this data could not be distributed freely
Czech can be explained by the fact that native speakers of
                                                                            due to licensing issues. To obviate this, we evaluate Ko-
Czech are aware of the importance of diacritics both for
                                                                            rektor using two new language models built from two cor-
distinguishing the meaning and for giving the text an ap-
                                                                            pora available from the Czech National Corpus (CNC):
propriate status. The high number of errors in diacritics
                                                                            (i) SYN2005 [2] and (ii) SYN2010 [9]. Both have the
in learner texts is confirmed by results shown in Table 3,
                                                                            size of 100 million words each and have a balanced rep-
counted on the training portion of the “CzeSL-MAN” cor-
                                                                            resentation of contemporary written Czech: news, fiction,
pus by comparing the uncorrected and corrected forms,
                                                                            professional literature etc.
restricted to single-edit corrections.2 The distribution is
                                                                               We use the error model and the test data (only the Audio
shown separately for the two annotation levels of CzeSL-
                                                                            data set) described in [12]. Audio contains 1371 words
MAN: somewhat simplifying, L1 is the level where non-
                                                                            with 218 spelling errors, of which 12 are real-word errors.
words (forms spelled incorrectly in any context) are cor-
                                                                                 3 The two error types below are actually subtypes of the substitution
      2 I.e., without using the “formal error types” of [7].                error.
Improvements to Korektor: A Case Study with Native and Non-Native Czech                                                             75


For the CNC corpora, we build 3rd order language models                   LM train data   Max. edit distance    P      R     F1
using KenLM [6].                                                          WebColl                              94.7   90.8   92.7
   The spell checker accuracy is measured in terms of stan-
                                                                          SYN2005                 1            95.7   90.8   93.2
dard precision and recall. The precision and recall mea-
                                                                          SYN2010                              94.7   89.9   92.2
sures are calculated at two levels: (i) error detection and
(ii) error correction. These evaluation measures are simi-                WebColl                              94.1   95.4   94.8
lar in spirit as in [17]. For both levels, precision, recall and          SYN2005                 2            95.0   95.9   95.4
other related measures are calculated as: Precision(P) =                  SYN2010                              94.1   95.0   94.5
   TP                      TP                             2∗P∗R
T P+FP , Recall(R) = T P+FN , and F − score(F1) = P+R ,                   WebColl                              94.1   95.4   94.8
where, for error detection,                                               SYN2005                 3            95.0   95.9   95.4
                                                                          SYN2010                              94.1   95.0   94.5
   • TP – Number of words with spelling errors that the
     spell checker detected correctly                                     WebColl                              94.1   95.4   94.8
                                                                          SYN2005                 4            95.0   95.9   95.4
   • FP – Number of words identified as spelling errors                   SYN2010                              94.1   95.0   94.5
     that are not actually spelling errors
                                                                          WebColl                              94.1   95.4   94.8
   • TN – Number of correct words that the spell checker                  SYN2005                 5            95.0   95.9   95.4
     did not flag as having spelling errors                               SYN2010                              94.1   95.0   94.5

   • FN – Number of words with spelling errors that the
                                                                      Table 5: Error detection results with respect to different
     spell checker did not flag as having spelling errors
                                                                      language models
   and for error correction,
   • TP – Number of words with spelling errors for which              4     Work in Progress for Improving Spelling
     the spell checker gave the correct suggestion                          Correction of Non-Native Czech
   • FP – Number of words (with/without spelling errors)
                                                                      One of the main hurdle in obtaining a new error model is
     for which the spell checker made suggestions, and for
                                                                      the availability of annotated error data for training. Many
     those, either the suggestion is not needed (in the case
                                                                      approaches are available to somehow obtain error data au-
     of non-existing errors) or the suggestion is incorrect
                                                                      tomatically from sources such as the web [16]. The error
     if indeed there was an error in the original word
                                                                      data obtained from the web may be good enough for han-
   • TN – Number of correct words that the spell checker              dling simple typing errors, but not for the more compli-
     did not flag as having spelling errors and no sugges-            cated misspellings a learner/non-native speaker of a lan-
     tions were made                                                  guage makes. However, these approaches can be success-
                                                                      fully used to obtain general purpose spell checkers. One
   • FN – Number of words with spelling errors that the               resource which could be of some value to spell checking is
     spell checker did not flag as having spelling errors or          the learner corpus. Unlike native error corpus, the learner
     did not provide any suggestions                                  corpus of non-native or foreign speakers tend to have more
   The results for error detection and error correction are           errors ranging from orthographical, morphological to real-
shown in Tables 5 and 6, respectively. Maximum edit dis-              word errors. In this work, we try to address whether error
tance, i.e., the number of edit operations per word is set to         models from texts produced by native Czech speakers can
values from 1 to 5. In the case of error detection, the best          be applied to errors from non-native Czech texts and vice
overall performance is obtained for the SYN2005 corpus                versa. We also derive error analysis based on the results.
when the maximum edit distance parameter is 2, and there
is no change in results for the edit distance range from 3
to 5. Of the two CNC corpora, SYN2005 consistently pro-               4.1 CzeSL — a Corpus of Czech as a Second
vides better results than SYN2010 corpus. Differences in                  Language
the vocabulary could be the most likely reason.
   Even in the case of error correction, the best overall per-        A learner corpus consists of language produced by lan-
formance is obtained for SYN2005 with 94.5% F1-score.                 guage learners, typically learners of a second or foreign
We can also see that WebColl performs better in 3 out of              language. Deviant forms and expressions can be corrected
5 cases, but we should also note that this happens when               and/or annotated by tags making the nature of the error ex-
we include top-3 suggestions in the error correction. Oth-            plicit. The annotation scheme in CzeSL is based on a two-
erwise the SYN2005 model consistently provides better                 stage annotation design, consisting of three levels. The
scores. We have also experimented with pruned language                level of transcribed input (Level 0) is followed by the level
models and obtained similar results.                                  of orthographical and morphological corrections (Level 1),
                                                                      where only forms incorrect in any context are treated. The
76                                                                                                                  L. Ramasamy, A. Rosen, P. Straňák


                                            Max. edit             top-1                       top-2                  top-3
                     LM train data
                                            distance
                                                           P        R        F1         P      R      F1      P        R       F1
                     WebColl                              85.2     89.9      87.5      90.9   90.5    90.7   93.3     90.7    92.0
                     SYN2005                    1         87.9     90.1      89.0      92.3   90.5    91.4   93.7     90.7    92.2
                     SYN2010                              86.0     89.0      87.5      91.8   89.6    90.7   92.3     89.7    91.0
                     WebColl                              84.2     94.9      89.2      91.0   95.3    93.1   93.2     95.4    94.3
                     SYN2005                    2         86.8     95.5      91.0      91.8   95.7    93.7   93.2     95.8    94.5
                     SYN2010                              85.0     94.4      89.5      91.4   94.8    93.1   92.3     94.9    93.5
                     WebColl                              84.2     94.9      89.2      91.0   95.3    93.1   93.2     95.4    94.3
                     SYN2005                    3         86.8     95.5      91.0      91.4   95.7    93.5   92.7     95.8    94.2
                     SYN2010                              85.0     94.4      89.5      90.9   94.8    92.8   91.8     94.8    93.3
                     WebColl                              84.2     94.9      89.2      91.0   95.3    93.1   93.2     95.4    94.3
                     SYN2005                    4         86.8     95.5      91.0      91.4   95.7    93.5   92.7     95.8    94.2
                     SYN2010                              85.0     94.4      89.5      90.9   94.8    92.8   91.8     94.8    93.3
                     WebColl                              84.2     94.9      89.2      91.0   95.3    93.1   93.2     95.4    94.3
                     SYN2005                    5         86.8     95.5      91.0      91.4   95.7    93.5   92.7     95.8    94.2
                     SYN2010                              85.0     94.4      89.5      90.9   94.8    92.8   91.8     94.8    93.3

                              Table 6: Error correction results with respect to different language models


result is a string consisting of correct Czech forms, even                          and corrected forms, were assigned. The share of ‘out of
though the sentence may not be correct as a whole. All                              lexicon’ forms, as detected by the tagger, is slightly lower
other types of errors are corrected at Level 2.4                                    – 9.23%.
   This annotation scheme was meant to be used by hu-
man annotators. However, the size of the full corpus and                            4.2 The CzeSL-MAN Error Models
the costs of its manual annotation have led us to apply au-
                                                                                    We built two error models from the CzeSL-MAN corpus
tomatic annotation and find ways of its improvement.
                                                                                    – one for Level 1 (L1) errors and another for Level 2 (L2)
   The hand-annotated part of the corpus (CzeSL-MAN)
                                                                                    errors. As explained in Section 4.1 above, L1 errors are
now consists of 294 thousand word tokens in 2225 short
                                                                                    mainly non-word errors and L2 errors belong to real-word
essays, originally hand-written and transcribed.5 A part of
                                                                                    and grammatical errors, but still include form errors that
the corpus is annotated independently by two annotators:
                                                                                    are not corrected at L1 because the faulty form happens to
121 thousand word tokens in 955 texts. The authors are
                                                                                    be spelled as a form which would be correct in a differ-
both foreign learners of Czech and Czech learners whose
                                                                                    ent context. Extracting errors from the XML format used
first language is the Romani ethnolect of Czech.
                                                                                    for encoding the original and the corrected text at L1 is
   The entire CzeSL corpus (CzeSL-PLAIN) includes
                                                                                    straightforward. The only thing needed is to follow the
about 2 mil. word tokens. This corpus comprises tran-
                                                                                    links connecting tokens at L0 (the original tokens) and L1
scripts of essays of foreign learners and Czech students
                                                                                    (the corrected tokens) and to extract tokens for which the
with the Romani background, and also Czech Bachelor
                                                                                    links are labeled as correction links. In the error extraction
and Master theses written by foreigners.
                                                                                    process, we do not extract errors that involve joining or
   The part consisting of essays of foreign learners only in-
                                                                                    splitting of word tokens at either level (Korektor does not
cludes about 1.1 word tokens. It is available as the CzeSL-
                                                                                    handle incorrectly split or joined words at the moment).
SGT corpus with full metadata and automatic annotation,
                                                                                       L2 errors include not only errors identified between L1
including corrections proposed by Korektor, using the
                                                                                    and L2 but also those identified already between L0 and
original language model trained on the WebColl corpus.6
                                                                                    L1, if any. This is because L2 tokens are linked to L0
In the annotation Korektor detected and corrected 13.24%
                                                                                    tokens through L1 tokens, rather than being linked di-
incorrect forms, 10.33% labeled as including a spelling
                                                                                    rectly. For example, consider a single token at Levels L0,
error, and 2.92% an error in grammar, i.e. a ‘real-word’                                                       f ormSingCh,incorBase
error. Both the original, uncorrected texts and their cor-                          L1 and L2: všechy (L0) −−−−−−−−−−−−→ všechny (L1)
                                                                                     agr
rected version was tagged and lemmatized, and “formal                               −−→ všichni (L2). The arrow stands for a link between
error tags,” based on the comparison of the uncorrected                             the two levels, optionally with one or more error labels.
     4 See [5] and [13] for more details.
                                                                                    For the L1 error extraction, the extracted pair of an incor-
     5 For an overview of corpora built as a part of the CzeSL project and          rect token and a correct token is (všechy, všechny) with the
relevant links see http://utkl.ff.cuni.cz/learncorp/.                               error labels (formSingCh, incorBase), and for the L2 er-
     6 See http://utkl.ff.cuni.cz/~rosen/public/2014-czesl-sgt-en.pdf.              ror extraction, the extracted error and correct token pair is
Improvements to Korektor: A Case Study with Native and Non-Native Czech                                                          77


                        CzeSL-L1          CzeSL-L2
       Error
                                                                                 Train data    Corpus size     #Errors
                      train     test     train      test
                                                                                 WebColl            111M       12,761
       single-edit    73.54    72.24     67.02     69.30
                                                                                 CzeSL-L1           383K       36,584
       multi-edit     26.46    27.76     32.98     30.70
                                                                                 CzeSL-L2           370K       54,131
Table 7: Percentage of single and multi edit-distance er-
rors in the train/test of L1 and L2 errors.                           Table 8: Training data for native and non-native experi-
                                                                      ments. The errors include both single and multi-edit er-
                                                                      rors.
(všechy, všichni) with the error labels (formSingCh, incor-
Base, agr). For the L2 errors, we project the error labels
of L1 onto L2. If there is no error present or annotated                         Test data     Corpus size     #Errors
between L0 and L1, then we use the error annotation be-
                                                                                 Audio              1,371         218
tween L2 and L1. The extracted incorrect token is still
from L0 and the correct token from L2.                                           CzeSL-L1          33,169       3,908
   Many studies have shown that most misspellings are                            CzeSL-L2          32,597       5,217
single-edit errors, i.e., misspelled words differ from their
correct spelling by exactly one letter. This also holds for           Table 9: Test set for native and non-native experiments.
our extracted L1 and L2 errors (Table 7). We train our L1             The errors include both single and multi-edit errors.
and L2 errors on single-edit errors only, thus the models
are quite similar to the native Czech error model described
                                                                      as capitalization or keyboard layouts, so there is still some
in [11]. The error training is based on [1]. Error proba-
                                                                      scope for improvements on the non-native error models.
bilities are calculated for the four single-edit operations:
                                                                      While webcoll and czesl_L2 models help each other in the
substitution, insertion, deletion, and swap.
                                                                      opposite direction, i.e., the performance of native model
                                                                      on the non-native data and vice versa, the czesl_L1 model
5 Experiments with Native and Non-Native                              works better only on the CzeSL-L1 dataset. In other
                                                                      words, since L1 error annotation did not involve complete
  Error Models
                                                                      correction of the test data of CzesL-MAN, they can be
                                                                      used, for instance, the correction of misspellings that do
For the native error model (webcoll), we use the same
                                                                      not involve grammar errors.
model as described in [12]. For the non-native error mod-
els, we create two error models as described in Section 4.2:
(i) czesl_L1 – trained on the L1 errors (CzeSL-L1 data                6   Discussion
in Table 8) and (ii) czesl_L2 – trained on the L2 errors
(CzeSL-L2 data in Table 8). We partition the CzeSL-MAN                We manually analyzed a part (the top 3000 tokens) of the
corpus in the 9:1 proportion for training and testing.                output of Korektor for the CzeSL-L2 test data for all the
   The non-native training data include more errors than              three models. We broadly classify the test data as hav-
those automatically mined from web. The training of non-              ing form errors (occurring between the L0 and L1 level),
native error models is done on single-edit errors only (refer         grammar (gram) errors (occurring between L1 and L2) and
Table 7 for the percentage of errors used for training). For          accumulated errors (form+gram, where errors are present
the language model, we use the best model (SYN2005)                   at all levels – between L0 and L1, and L1 and L2). The
that we obtained from Section 3.                                      CzeSL-L2 test data can include any of the above types
   We perform evaluation on all kinds of errors in test data.         of errors. About 23% of our analyzed data include one
We also set the maximum edit distance parameter to 2 for              of the above errors. More than half of the errors (around
all our experiments. We arrived at this value based on our            62%) belong to the form errors and about 27% belong to
observation in various experiments. We run our native and             the gram class. The remaining errors are the form+gram
non-native models on the test data described in Table 9,              errors.
and their results are given in Table 10. Error correction                In the case of form errors, both the native (webcoll) and
results are shown for top-3 suggestions.                              the non-native models (czesl_L1 and czesl_L2) detect er-
   In error detection, in terms of F1-score, czesl_L2 model           rors at the rate of more than 89%. Form errors may or
posts better score than the other two models for both native          may not be systematic and they are easily detected by all
and non-native data sets. When it comes to error correc-              the three models. Most of the error instances in the data
tion, the native model webcoll seems to perform better in             can be categorized under either missing/addition of dia-
2 out of 3 data sets, and the next better performer being             critics, or they can occur in combination with other types
the czesl_L2 model. One has to note that, the non-native              of errors, for instance, přítelkyně was incorrectly written
models are not tuned to any particular phenomenon such                as přatelkine.
78                                                                                                                     L. Ramasamy, A. Rosen, P. Straňák


                                       Error detection                                                             Error correction
Model
                   Audio                    CzeSL-L1                 CzeSL-L2                  Audio                     CzeSL-L1             CzeSL-L2

             P       R       F1       P        R        F1     P        R        F1      P       R       F1        P          R      F1     P      R      F1
webcoll     95.0    95.9    95.4     81.8     81.7     81.7   91.0     65.0     75.9    93.2    95.8    94.5      71.7       79.6   75.4   78.0   61.5   68.8
czesl_L1    95.0    96.8    95.9     82.2     82.2     82.2   91.1     64.4     75.4    93.7    96.7    95.2      70.2       79.8   74.7   75.5   60.0   66.8
czesl_L2    95.0    96.8    95.9     81.2     82.7     81.9   90.9     65.4     76.1    93.7    96.7    95.2      68.2       80.0   73.6   74.9   60.9   67.2


                                    Table 10: Error models applied to native and non-native Czech


Error label: "form:formCaron0 + formSingCh                                       incorrect usage        correct usage          category    gloss
+ formY0 + incorBase + incorInfl"                                                bavímSG                bavímePL               number      enjoy
Error token: přatelkine                                                          bylSG                  bylyPL                 number      was → were
Gold token: přítelkyně                                                           bylSG                  BylyPL                 number      was → were
webcoll: přatelkine                                                              Chci1 ST               Chce3 RD               person      want → wants
czesl_L1: <suggestions="přítelkyně|pritelkyne                                    chodímSG               chodímePL              number      walk
|přátelíme">
                                                                                 ChtělaFEM             ChtělMASC             gender      wanted
czesl_L2: <suggestions="přítelkyně|pritelkyne
                                                                                 dívat INF              dívá3 RD               verb form   to see → sees
|přátelíme">
                                                                                 dobré FEM              dobří MASC . ANIM     gender      good
   In the case of gram errors, most of the errors are unde-                      dobrý MASC             dobráFEM               gender      good
tected. Out of 193 gram errors in our analyzed data, the                         druhý NOM              druhéhoGEN             case        2nd, other
                                                                                 hezké PL               hezký SG               number      nice
percentage of errors detected by the models are: webcoll
                                                                                 jeSG                   jsouPL                 number      is → are
(15.5%), czesl_L1 (9.3%) and czesl_L2 (15.0%). Most of
                                                                                 jednouINS              jedné LOC              case        one
the grammar errors involve agreement, dependency and
                                                                                 jichGEN                jeACC                  case        them
lexical errors. The agreement errors are shown in Table 11.
                                                                                 jsemSG                 jsmePL                 number      am → are
Except for a few pairs such as jedné → jednou (incorrect                         jsmePL                 jsemSG                 number      are → am
→ correct), mě → mé, který → kteří, teplí → teplý, most                        jsouPL                 jeSG                   number      are → is
of the error tokens involving agreement errors have not                          který SG               kteří PL              number      which
been recognized by any of the three models.7                                     leželiMASC . ANIM      leželyFEM              gender      lay
   Dependency errors (e.g. a wrongly assigned morpho-                            malý SG                malé PL                number      small
logical case, missing a syntactic governor’s valency re-                         malýchGEN              malé ACC               case        small
quirement) such as rokuGEN → roceLOC ‘year’, kolejACC                            mé ACC                 mí NOM                 number      my
→ kolejiLOC ‘dormitory’, rokuSG → rokyPL ‘year’, restau-                         MěPERS . PRON         Mé POSS . PRON         POS         me → my
raciLOC → restauraceNOM ‘restaurant’ have not been rec-                          miluju1 ST             miluje3 RD             person      love → loves
ognized by any of the models. The pair miDAT → měACC                            mohliMASC . ANIM       mohlyFEM               gender      could
‘me’ has been successfully recognized by all the three                           nemocní PL             nemocný SG             number      ill
models and the correct suggestion listed in the top:                             nichLOC                něACC                 case        them
                                                                                 oslaviliMASC . ANIM    oslavilaNEUT           gender      celebrated
Error label: "gram:dep"                                                          pracovní NOM           pracovnímINS           case        work-related
Error token: mi                                                                  pracuji1 ST            pracuje3 RD            person      work → wants
Gold token: mě                                                                   StudovaliMASC . ANIM   studovalyFEM           gender      studied
webcoll: <suggestions="mě|mi|ji|mu|i">                                           tepleADV cc            teplé ADJ              POS         warmly → warm
czesl_L1: <suggestions="mě|mi|ji|mu|si">                                         teplí PL               teplý SG               number      warm
czesl_L2: <suggestions="mě|mi|ji|mu|ho">                                         tří GEN               třiACC                case        three
                                                                                 tuhleFEM               TenhleMASC             gender      this
   For instance, the pair ve → v ‘in’ (vocalized → unvo-
                                                                                 typické FEM            typickáNEUT            gender      typical
calized) has been recognized by the webcoll and czesl_L2
                                                                                 velké PL               velký SG               number      big
models, but not by the czesl_L1 model. When it comes
to grammar errors, webcoll and czesl_L2 have better per-
                                                                                 Table 11: Some of the agreement errors in the analyzed
formance than czesl_L1. It was expected, because the
                                                                                 portion of the CzeSL-L2 test data
czesl_L1 model was not trained on grammar errors.
   When the error involved a combination of form and
gram errors, all the three models tend to perform bet-
ter. Most of the form+gram errors were recognized by                             all the three models: webcoll (85%), czesl_L1 (86%) and
     7 The category glosses should be taken with a grain of salt: many
                                                                                 czesl_L2 (89%). For instance, the error pair *zajímavy →
                                                                                 zajímavé ‘interesting’ that was labeled at both L1 and L2
forms can have several interpretations. E.g. oslaviliMASC . ANIM → oslav-
ilaNEUT ‘celebrated’ could also be glossed as oslaviliPL , MASC . ANIM →         level was successfully recognized by all the models, and
oslavilaSG , FEM .                                                               the correct suggestions were listed in the top. There were
Improvements to Korektor: A Case Study with Native and Non-Native Czech                                                                 79


many errors that were successfully recognized, but the cor-           further, we would like to investigate how the more com-
rect suggestions did not appear in top-3, such as, *nechcí            plex grammar errors such as those in agreement and form
→ nechtěl ‘didn’t want’, *mym → svým ‘my’, *kamarad                  errors such as joining/splitting of word forms can be mod-
→ kamaráda ‘friend’, *vzdělany → vzdělaná ‘educated’.               eled. Further, we would like to analyze non-native Czech
   Based on the results in Table 10 and the manual error              models, so that Korektor can be used to annotate a large
analysis in this section, we can make the following general           Czech learner corpus such as CzeSL-SGT more reliably.
observations:

   • Non-native Czech models can be applied to native                 References
     test data and obtain even better results than the na-
     tive Czech model (Table 10).
                                                                       [1] Church, K., Gale, W.: Probability scoring for spelling cor-
                                                                           rection. Statistics and Computing 1(7) (1991), 93–103
   • From the manual analysis of the test outputs of both
     native and non-native Czech models, the most prob-                [2] Čermák, F., Hlaváčová, J., Hnátková, M., Jelínek, T., Ko-
                                                                           cek, J., Kopřivová, M., Křen, M., Novotná, R., Petkevič, V.,
     lematic errors are the grammar errors due to missed
                                                                           Schmiedtová, V., Skoumalová, H., Spoustová, J., Šulc, M.,
     agreement or government (valency requirements).
                                                                           Velíšek, Z.: SYN2005: a balanced corpus of written Czech,
     Some of the grammar errors involve most commonly                      2005
     occurring Czech forms such as jsme, byl, dobrý, je,
                                                                       [3] Gimenes, P. A., Roman, N. T., Carvalho, A. M. B. R.:
     druhý.                                                                Spelling error patterns in Brazilian Portuguese. Compu-
                                                                           tational Linguistics 41(1) (2015), 175–183
   • Both native and non-native error models perform well
                                                                       [4] Golding, A. R., Roth, D.: A window-based approach to
     on spelling-only errors.
                                                                           context-sensitive spelling correction. Machine Learning 34
   • The CzeSL-MAN error data include errors that in-                      (1999), 107–130 10.1023/A:1007545901558.
     volve joining/splitting of word forms that we did not             [5] Hana, J., Rosen, A., Škodová, S., Štindlová, B.: Error-
     handle in our experiments. We also skipped word or-                   tagged learner corpus of Czech. In: Proceedings of the
                                                                           Fourth Linguistic Annotation Workshop, Uppsala, Sweden,
     der issues in the non-native errors which are beyond
                                                                           Association for Computational Linguistics, 2010
     the scope of current spell checker systems.
                                                                       [6] Heafield, K.: KenLM: faster and smaller language model
                                                                           queries. In: Proceedings of the EMNLP 2011 Sixth Work-
                                                                           shop on Statistical Machine Translation, 187–197, Edin-
7 Conclusions and Future Work
                                                                           burgh, Scotland, United Kingdom, 2011
                                                                       [7] Jelínek, T., Štindlová, B., Rosen, A., Hana, J.: Com-
We have tried to improve both the language model and                       bining manual and automatic annotation of a learner cor-
the error model component of Korektor, a Czech statisti-                   pus. In: Sojka, P., Horák, A., Kopeček, I., Pala, K., (eds.),
cal spell checker. Language model improvements involved                    Text, Speech and Dialogue – Proceedings of the 15th In-
the employment of more balanced corpora from the Czech                     ternational Conference TSD 2012, number 7499 in Lecture
National Corpus, namely SYN2005 and SYN2010. We                            Notes in Computer Science, 127–134, Springer, 2012
obtained better results for the SYN2005 corpus.                        [8] Junczys-Dowmunt, M., Grundkiewicz, R.: The AMU Sys-
   Error model improvements involved creating non-native                   tem in the CoNLL-2014 Shared Task: Grammatical er-
error models from CzeSL-MAN, a hand-annotated Czech                        ror correction by data-intensive and feature-rich statistical
learner corpus, and a series of experiments with native and                machine translation. In: Proceedings of the Eighteenth
non-native Czech data sets. The state-of-the-art improve-                  Conference on Computational Natural Language Learning:
ment for the native Czech data set comes from the non-                     Shared Task, 25–33, Baltimore, Maryland, Association for
                                                                           Computational Linguistics, 2014
native Czech models trained on L1 and L2 errors from
CzeSL-MAN. Surprisingly, the native Czech model per-                   [9] Křen, M., Bartoň, T., Cvrček, V., Hnátková, M., Jelínek, T.,
                                                                           Kocek, J., Novotná, R., Petkevič, V., Procházka, P.,
formed better for non-native Czech (L2 data) than the
                                                                           Schmiedtová, V., Skoumalová, H.: SYN2010: a balanced
non-native models. This we attribute to the rich source
                                                                           corpus of written Czech, 2010
of learner error data, since the texts come from very dif-
                                                                      [10] Mays, E., Damerau, F. J., Mercer, R. L.: Context based
ferent texts: Czech students with Romani background, as
                                                                           spelling correction. Information Processing & Manage-
well as learners with various proficiency levels and first                 ment 27 (5) (1991), 517–522
languages. Another potential reason could be the untuned
                                                                      [11] Richter, M.: An advanced spell checker of Czech. Mas-
nature of the non-native error models that may require fur-                ter’s Thesis, Faculty of Mathematics and Physics, Charles
ther improvement.                                                          University, Prague, 2010
   As for future work aimed at further improvements of                [12] Richter, M., Straňák, P., Rosen, A.: Korektor — a system
Korektor, we plan to explore model combinations with na-                   for contextual spell-checking and diacritics completion. In:
tive and non-native Czech models. We would also like to                    Proceedings of the 24th International Conference on Com-
extend Korektor to cover new languages so that more anal-                  putational Linguistics (Coling 2012), 1019–1027, Mumbai,
ysis results could be obtained. To improve error models                    India, (2012), Coling 2012 Organizing Committee
80                                                                  L. Ramasamy, A. Rosen, P. Straňák


[13] Rosen, A., Hana, J., Štindlová, B., Feldman, A.: Eval-
     uating and automating the annotation of a learner corpus.
     Language Resources and Evaluation — Special Issue: Re-
     sources for language learning 48 (1) (2014), 65–92
[14] Rozovskaya, A., Chang, K. -W., Sammons, M., Roth, D.,
     Habash, N.: The Illinois-Columbia System in the CoNLL-
     2014 Shared Task. In: CoNLL Shared Task, 2014
[15] Rozovskaya, A., Roth, D.: Building a state-of-the-art
     grammatical error correction system, 2014
[16] Whitelaw, C., Hutchinson, B., Chung, G. Y., Ellis, G.: Us-
     ing the web for language independent spellchecking and
     autocorrection. In: Proceedings of the 2009 Conference
     on Empirical Methods in Natural Language Processing –
     Volume 2, EMNLP’09, 890–899, Stroudsburg, PA, USA,
     Association for Computational Linguistics, 2009
[17] Wu, S. -H., Liu, C. -L., Lee, L. -H.: Chinese spelling check
     evaluation at SIGHAN Bake-off 2013. In: Proceedings of
     the Seventh SIGHAN Workshop on Chinese Language Pro-
     cessing, 35–42, Nagoya, Japan, Asian Federation of Natu-
     ral Language Processing, 2013

</pre>