ITAT 2016 Proceedings, CEUR Workshop Proceedings Vol. 1649, pp. 80–87 http://ceur-ws.org/Vol-1649, Series ISSN 1613-0073, c 2016 A. Rosen Building and using corpora of non-native Czech Alexandr Rosen Institute of Theoretical and Computational Linguistics, Faculty of Arts Charles University in Prague 1 Introduction The tabular format is also used in MERLIN, one of the two currently available corpora including Czech.2 In ad- Investigating language acquisition by non-native learners dition to 64.5K words of Czech in CEFR levels A1–C1, helps to understand important linguistic issues and develop the corpus includes also German and Italian. It is tagged, teaching methods, better suited both to the specific target lemmatized, parsed and on-line searchable, with a detailed language and to the learner. These tasks can now be based error taxonomy and the option of two target hypotheses. on empirical evidence from learner corpora. A learner corpus consists of language produced by lan- guage learners, typically learners of a second or foreign 3 CzeSL – the learner corpus of Czech as a language (L2). Such corpora may be equipped with mor- Second Language phological and syntactic annotation, together with the de- tection, correction and categorization of non-standard lin- CzeSL is a part of an umbrella project, the Acquisition guistic phenomena. Corpora of Czech (AKCES), a research programme pur- The tasks of designing, compiling, annotating and pre- sued since 2005 (Šebesta, 2010). In addition to CzeSL, senting such corpora are often very much unlike those rou- AKCES has a written (SKRIPT) and spoken (SCHOLA) tinely applied to standard corpora. There may be no stan- part collected from native Czech pupils, and ROMi, a part dard or obvious solutions: the approach to the tasks is of- collected from pupils with Romani background, using the ten seen as an answer to a specific research goal rather Romani ethnolect of Czech as their first language (L1). In than as a service to a wider community of researchers and the present paper we focus on written texts produced by practitioners. Our aim is to investigate some of the chal- non-native learners of Czech. However, most of the meth- lenges, based on a learner corpus of Czech in comparison ods and tools can be applied to other parts of the corpus. to several other learner corpora. CzeSL is focused on native speakers of three main lan- After an overview of learner corpora around the world guage groups: (1) Slavic, (2) other Indo-European, (3) in §2 and a brief presentation of several releases of a non-Indo-European. The hand-written texts cover all lan- learner corpus of Czech in §3, we examine issues inherent guage levels, from real beginners (A1) to advanced learn- to the process of compiling, annotating and using such cor- ers (B2, C1, C2). The texts are equipped with metadata pora, including automatic identification of errors, the de- records; some of them relate to the respondent (age, gen- sign and application of error taxonomy, and a user-friendly der, first language, proficiency in Czech, knowledge of search tool, suited to a complex annotation (§4). other languages, duration and conditions of language ac- quisition), while other specify the character of the text and circumstances of its production (availability of reference 2 About learner corpora tools, type of elicitation, temporal and size restrictions etc.). Most of the existing learner corpora include English (L2) The hand-written texts were transcribed using off-the- as produced by students whose native languages (L1) are shelf editors supporting HTML (e.g., Microsoft Word or varied. Most of the corpora are partially error-annotated, Open Office Writer). A set of codes was used to cap- see Table 1 on p. .1 The error annotation is usually in- ture variants, illegible strings, self-corrections; for details line, equivalent to XML tags, denoting the scope, correc- see (Štindlová, 2011b, p. 106ff). During the transcrip- tion and categorization of an error. A few corpora such tion step, the texts were anonymized by replacing personal as FALKO include multi-layered annotation in a tabular names with appropriate forms of Adam and Eva. Names format, with the option of specifying multiple target hy- of smaller places (streets, villages, small towns) and other potheses (corrections) and several error types for single potentially sensitive data were replaced by QQQ. Unread- word tokens or strings thereof at different levels of linguis- able characters or words were transcribed as XXX. tic abstraction: orthography, morphology, syntax, lexicon, The transcripts were converted into an XML format. pragmatics, intelligibility. Some of them were corrected (‘emended’) and labelled 2 Multilingual Platform for European Reference Levels: Interlan- 1 For a more extensive overview see Štindlová (2011a) or an actively guage Exploration in Context, see http://merlin-platform.eu and Wis- maintained list at https://www.uclouvain.be/en-cecl-lcworld.html. niewski et al. (2014); Boyd et al. (2014) Building and Using Corpora of Non-Native Czech 81 by error categories using a custom-built annotation edi- lation.8 The level of transcribed input (Tier 0) is followed tor, supporting a two-layered annotation format with m : n by the level of orthographical and morphemic corrections links between tokens at the neighbouring tiers.3 In a post- (Tier 1), where only forms incorrect in any context are processing step the hand-annotated texts were tagged by treated. Errors at Tier 1 are mainly non-word errors while tools trained on native Czech in a way similar to stan- those at Tier 2 are real-word and grammatical errors. How- dard corpora, i.e. by lemmas, morphosyntactic categories, ever, a faulty form that happens to be spelled as a form in some (currently non-public) releases of the corpus also which would be correct in a different context, is still cor- by syntactic functions and structure. Some error annota- rected at Tier 1. The result at Tier 1 is a string consist- tion tasks were also done automatically: the assignment of ing of correct Czech forms, even though the sentence may formal error labels and even the correction step (the latter not be correct as a whole. All other types of errors are in Czesl-SGT, see §3.2). corrected at Tier 2, representing a grammatically correct, There are several public releases of CzeSL, which dif- though stylistically not necessarily optimal target hypothe- fer in the depth and method of annotation, but also in the sis.9 Manual annotation is complemented by morphosyn- availability of metadata and size. Table 2 shows the con- tactic tags and lemmas at Tier 2, ambiguously specified tent of available releases of CzeSL, including the volumes tags and lemmas at Tier 1, and automatically identified for- (in thousands of tokens), and the availability of annotation mal errors.10 Splitting, joining and reordering words, to- and metadata.4 gether with the pointers may make the picture rather com- plex, as in an authentic sentence in Figure 1 on p. . The three tiers are represented as parallel strings of 3.1 Releases of CzeSL without metadata: word forms with links for corresponding forms. Tier 0 CzeSL-plain and CzeSL-man v. 0 is glossed for readability; forms marked by asterisks are Since 2012, the transcripts of essays hand-written by non- incorrect in any context. native learners (1.3 mil. tokens) and pupils speaking the Errors corrected at Tier 1 include incorrect inflec- Romani ethnolect of Czech (0.4 mil. tokens) have been tion (incorInfl), word boundaries (wbdPre), and stems available together with some Bachelor and Master the- (incorBase). Errors in punctuation (the missing comma), ses written in Czech by foreign students (0.7 mil. tokens) capitalization (prahu) or word order (se in the that-clause as the CzeSL-plain corpus, on-line searchable via a web- at Tier 2) are tagged automatically in a post-processing based search interface of the Czech National Corpus,5 or step. as full texts under the Creative Commons license from Tier 2 captures the rest of errors. Some error labels are the LINDAT repository.6 Except for specifying the three linked to a token which makes the reason for the correc- groups above and a basic structural mark-up, this corpus tion explicit. This includes errors in agreement (agr), gov- does not include any metadata or annotation. ernment or valency in a broad sense (dep), complex verb CzeSL-man v. 0 includes subsets of CzeSL and ROMi, forms (vbx) or reflexive particles (rflx). For example, ona about 330 thousand tokens. It is manually error-annotated in the nominative case is governed by the form líbit se, and at two levels. Texts of about 208 thousand tokens are anno- should be in the dative case: jí. The label dep has an ar- tated independently by two annotators. Like CzeSL-plain, row pointing to the governor líbit. There is also a simple the whole hand-annotated part is accessible online with- lexical correction: Proto ‘therefore’ is changed to protože out metadata via a purpose-built search tool (SeLaQ);7 for ‘because’. more about the manual annotation and the annotation pro- However, the main issue are the two finite verbs bylo cess see Hana et al. (2014). and vadí. The most likely intention of the author is best ex- The manual annotation scheme in CzeSL is based on pressed by the conditional mood. The two non-contiguous a two-stage annotation design, reflecting the distinction forms are replaced by the conditional auxiliary and the roughly between errors in orthography and morphemics content verb participle in one step using a 2:2 relation. on the one hand and all other error types on the other. To- Another complex issue is the prepositional phrase pro mně kens in the original transcript are linked with their coun- ‘for me’. Its proper form is pro mě (homonymous with pro terparts at the two successive levels by edges, possibly mně, but with ‘me’ in accusative instead of dative), or pro labelled with the type of error – see Figure 1 on p. . A mne. The accusative case is required by the preposition syntactic error label may be linked by a pointer to a word pro. However, the head verb requires that this comple- token, specifying an agreement, valency or referential re- ment bears bare dative – mi. Additionally, this form is a 8 This scheme is already a compromise between a linear annotation and an open multi-layered format, but a compromise preserving links be- 3 https://bitbucket.org/jhana/feat tween split, joined and re-ordered tokens, corrected in two stages simul- 4 Some texts in CzeSL-man v.0 are doubly annotated. The texts an- taneously, something not obviously supported in the multilayered tabular notated by an additional annotator are included in the CzeSL-man v.0, a2 format mentioned above in §2. part. See http://utkl.ff.cuni.cz/learncorp/ for links and more details. 9 See Hana et al. (2010) and Rosen et al. (2014) for more details. 5 https://kontext.korpus.cz 10 See Jelínek et al. (2012) for details, including a list of formal error 6 http://lindat.mff.cuni.cz types. The last column of Table 3 shows examples of the formal error 7 http://chomsky.ruk.cuni.cz:5125 labels. 82 A. Rosen clitic, following the conditional auxiliary. error at Tier 1 (62%), a grammar error at Tier 2 (27%), The correction slavnouaccusative →slavnánominative is due or an accumulated error at both tiers (11%). Form errors to the correction of the case of the head noun. Such cor- were detected with a success rate of 89%. For grammar er- rections receive an additional label as secondary errors. rors (real-word errors) the detection rate was much lower, about 15.5%. The detection of accumulated errors was similar to form errors (89%). 3.2 The automatically anotated CzeSL-SGT After all the automatic annotation steps are finished, The ‘real’ CzeSL, i.e. the corpus consisting of essays writ- each token is labelled by the following attributes: ten only by non-native learners (1.1 mil. tokens), is avail- able with automatic annotation as CzeSL-SGT,11 extend- • word – original word form ing the “foreign” part of the CzeSL-plain corpus by texts • lemma – lemma of word; same as word if the form is collected in 2013. This was the first release of CzeSL in- not recognized cluding full metadata. The corpus includes 8,617 texts by 1,965 different authors with 54 different first languages. • tag – morphological tag of word; if the form is not The original transcription markup is discarded in this cor- recognized: X@------------- pus, while the final author’s version is restored. The cor- pus is available again either for on-line searching using • word1 – corrected form; same as word if determined the search interface of the Czech National Corpus or for as correct download from the LINDAT data repository.12 • lemma1 – lemma of word1 Word forms are tagged by word class, morphological categories and base forms (lemmas). Some forms are cor- • tag1 – morphological tag of word1 rected by Korektor, a context-sensitive spelling/grammar checker,13 and the resulting texts are tagged again. Origi- • gs – information on whether the error was deter- nal and corrected forms are compared and error labels are mined as a spelling (S) or grammar (G) error; for assigned. Korektor detected and corrected 13.24% incor- grammar errors, word is mostly recognized rect forms, 10.33% labelled as including a spelling error, • err – error type, determined by comparing word and and 2.92% an error in grammar, i.e. a ‘real-word’ error. word1. Both the original, uncorrected texts and their corrected version were tagged and lemmatized, and “formal error Table 3 on p. shows the use of the annotation in a sim- tags,” based on the comparison of the uncorrected and cor- ple sentence (1).15 rected forms, were assigned.14 The share of non-words de- tected by the tagger is slightly lower – 9.23% (the tagger (1) Tén pes míluje svécho kamarada – člověka. uses a larger lexicon). that dog loves self’s friend – man Automatic correction is a crucial annotation step. The ‘That dog loves its friend – the man.’ tool is concerned mainly with errors in orthography and In addition to the attributes listed above, the search in- morphemics, and handles some errors in morphosyntax, terface of the Czech National Corpus offers “dynamic” at- including real-word errors (i.e. errors that produce a word tributes, derived from some positions of tag and tag1. which seems to be correct out of context), as long as they Dynamic attributes can be used in queries to specify val- are detectable locally, within a reasonably small window ues of morphological categories without regular expres- of n-grams. Corrections are limited to single words, tar- sions, to stipulate identity of these values in two or more getting a single character or a very small number of char- forms to require grammatical concord, or to compare val- acters by insertion, omission, substitution, transposition, ues of a category for word and word1. These attributes addition, deletion or substitution of a diacritic. Errors that are available for the following categories of the original involve joining or splitting of word tokens or word-order and the corrected form: errors of any type are not handled at the moment. The performance of Korektor was evaluated first in • k, k1 – word class (position 1 of the tag) Štindlová et al. (2012) with about 20% error rate on the set of non-words, and later in Ramasamy et al. (2015). In • s, s1 – detailed word class (position 2 of the tag) an optimal setting of the model, the best results achieved in terms of F1 score were 95.4% for error detection and • g, g1 – gender (position 3 of the tag) 91.0% for error correction. In a manual analysis of 3000 • n, n1 – number (position 4 of the tag) tokens, about 23% of the tokens included either a form • c, c1 – case (position 5 of the tag) 11 Czech as a Second Language with Spelling, Grammar and Tags 12 http://hdl.handle.net/11234/1-162 13 See Richter et al. (2012). The tool is available from the LINDAT 15 The example comes from a CzeSL-SGT text, written by a 17 years repository (https://lindat.mff.cuni.cz) under the FreeBSD license. old student, with Russian as L1 and B2 as the proficiency level in Czech 14 See Jelínek et al. (2012). (document ID ttt_G1_434). Building and Using Corpora of Non-Native Czech 83 • p, p1 – person (position 8 of the tag) CzeSL-SGT CzeSL-man v. 1 Texts 8,600 645 They are meant especially for CQL queries16 including Sentences 111K 11K a “global condition”. As in standard corpora, such queries Words 958K 104K target two or more word tokens with an arbitrary but equal Tokens 1,148K 128K value of an attribute such as case to express grammatical agreement and similar morphosyntactic phenomena (2). Different authors 1,965 262 Different L1s 54 32 (2) 1:[] 2:[] & 1.c = 2.c Proficiency levels A1–C2 A1–C1 In a learner corpus, such queries make sense even for a Women/Men 5:3 3:2 single word token, e.g. for expressing identical or distinct Words per text 100–200 100–200 values of the morphological case of the original form and Table 5: CzeSL-man v. 1 and CzeSL-SGT compared of its corrected version (3).17 (3) 1:[] & 1.c != 1.c1 S IE nIE unknown Σ A1 49 6 4 59 In a learner corpus, metadata about the author of the text A1+ 3 3 are at least as important as all other types of annotation. A2 18 26 67 111 For the number of texts authored by students according A2+ 81 9 59 149 to their first language and the CEFR proficiency level in B1 123 26 30 179 Czech see Table 4 below. The language group abbrevia- tions read as follows: IE = non-Slavic Indo-European, nIE B2 102 11 15 128 = non-Indo-European, S = Slavic. C1 10 2 12 unknown 4 4 S IE nIE unknown Σ Σ 383 78 180 4 645 A1 1783 199 622 5 2609 Table 6: Number of texts by language group and profi- A1+ 283 21 11 0 315 ciency level in CzeSL-man v. 1 A2 1348 269 480 1 2098 A2+ 403 54 113 0 570 In addition to the number of tokens for the same cate- B1 929 195 357 0 1481 gory, Table 8 shows also the frequency of errors of the dep B2 523 115 107 0 745 type, i.e. valency errors in the broad sense, including er- C1 82 17 24 0 123 rors in the number of complements and adjuncts or errors C2 0 1 0 0 1 in their morphosyntactic expression. The rather frequent error type shows a considerable and expected decrease in unknown 291 27 33 324 675 higher proficiency levels Σ 5642 898 1747 330 8617 CzeSL-man v. 1 is about to be released soon for down- load in the LINDAT repository and for on-line searching Table 4: Number of texts by language group and profi- in https://kontext.korpus.cz. Some solutions to the prob- ciency level in CzeSL-SGT lem of using a feature-rich corpus search engine, which is still not suited to the two-level annotation scheme of CzeSL-man, are presented in 4. 3.3 CzeSL-man v. 1 CzeSL-man v. 1 is a collection of manually annotated tran- 4 Some issues and lessons learnt scripts of essays of non-native speakers of Czech, written in 2009–2013, the total of 645 texts, including 298 doubly Several points can be made about some of the CzeSL re- annotated texts. The texts contain 128 thousand word to- leases, reflecting issues involved in the design, compila- kens, including 59 thousand doubly annotated tokens; for tion and presentation of learner corpora. a comparison with CzeSL-SGT see Table 5. We start with CzeSL-plain and its hand-annotated part Tables 6 and 7 show the number of texts for each com- CzeSL-man v. 0: (i) Both corpora include some ROMi bination of CEFR level and language group in CzeSL-man texts, actually produced by native speakers of a dialect v. 1. of Czech, rather than by non-native speakers of Czech. This is due to the original strategy of grouping texts by 16 See https://www.sketchengine.co.uk/corpus-querying/ the way they are processed. This has been changed in later 17 Unfortunately, queries including global conditions on dynamic at- releases, where texts produced by non-native and native tributes do not produce expected results in the present version of the Man- learners (the latter including speakers of the Romani eth- atee search engine. nolect of Czech) are parts of distinct corpora. (ii) Neither 84 A. Rosen S IE nIE Σ The Manatee corpus search engine, used in the Czech A1 37 2 1 40 National Corpus, and its (No)Sketch Engine front end ac- A1+ 3 3 tually include support for learner corpora,18 . The in-line A2 5 23 47 75 annotation can even have embedded structures, which may A2+ 21 6 49 76 be used at least for some cases of multi-layered annotation. Making CzeSL-man with most of the annotation available B1 20 23 28 71 this way thus seems a real prospect. B2 7 11 12 30 C1 1 2 3 Σ 91 65 142 298 4.1 Corpus design and planning Table 7: Number of doubly annotated texts by language The target corpus may be intended for a group of users group and proficiency level in CzeSL-man v. 1 with specific research or practical needs, or for a wide audience of language acquisition experts, researchers or practitioners. In any case the goals should be realistic A1 A2 B1 B2 C1 Σ in order to avoid a mission ending before the goals are IE 227 7,336 5,311 2,340 0 15,214 achieved. dep 13 361 118 28 0 520 %dep 5.73% 4.92% 2.22% 1.20% 3.42% nIE 439 17,640 7,606 4,219 760 30,664 4.2 Text acquisition dep 13 715 237 116 7 1,088 %dep 2.96% 4.05% 3.12% 2.75% 0.92% 3.55% Some balance or at least representative proportions of text S 6,434 16,939 27,226 22,173 4,761 77,533 and learner categories are necessary or at least useful. Ta- dep 225 470 652 443 17 1,807 bles 4–7 show an opposite, opportunistic approach, driven %dep 3.50% 2.77% 2.39% 2.00% 0.36% 2.33% by practical constraints, often justified by the unavailablity Σ 7,100 41,915 40,143 28,732 5,521 123,411 of texts of a specific category. dep 251 1,546 1,007 587 24 3,415 %dep 3.54% 3.69% 2.51% 2.04% 0.43% 2.77% 4.3 Transcription Table 8: Number of tokens and valency errors by language To avoid the need of cleaning transcripts with improperly group and proficiency level in CzeSL-man v. 1 used mark-up, an editing tool including strict format con- trols is preferable to a free-text editor. CzeSL-plain nor CzeSL-man v. 0 includes the full set of metadata, which were not available in the appropriate form 4.4 Annotation scheme and searching and content at the time the two corpora were prepared and released. In CzeSL-plain, the texts are categorized into A scheme ideally suited to the data may turn into a prob- three groups: as essays, written either by non-native learn- lem later, if the consequences for the annotation process ers, or by speakers of the Roma ethnolect of Czech, and as and the use of the corpus are not foreseen. Standard con- theses written by non-native students. In CzeSL-man v. 0 cordancers may require substantial tweaking of the data, there is no distiction available. (iii) Due to the uncertainty while a custom-built tool may lack features of the tools abouth the optimal way of representing the complex two- developed for a long time. At the same time, most users of level manual annotation, the SeLaQ tool cannot display the this type of corpora definitely need a friendly interface. two-level annotation format in a graphical format. There is a strong demand for CzeSL-man to become 5 Conclusion available for on-line searches at the Czech National Cor- pus portal, even if some of the properties and information We have presented several releases of a learner corpus of present in the corpus may get lost in the conversion to the Czech, available for on-line queries and under the Creative format used by the corpus search tool, based on the single- Commons license as full texts. level annotation of a string of tokens. However, the con- In order to reach its goals and become useful, a learner verted format might still retain enough annotation to be at- corpus project should be conceived carefully, considering tractive and useful for most tasks. Instead of assigning the many factors. By way of an example, we have shown some error-related annotation to word tokens, which makes the pitfalls in the process of building and presenting such a option to annotate strings of tokens, or even discontinuous corpus. strings very difficult, errors and corrections can be treated The methods and tools developed within this project are as structural annotation, i.e. similarly to the markup for not tied to the specific use and we hope they will be found paragraphs, sentences, phrases or text chunks. Even the useful in other projects. splitting and joining of words and word order corrections can then be expressed. 18 See https://www.sketchengine.co.uk/learner-corpus-functionality/ Building and Using Corpora of Non-Native Czech 85 Acknowledgements Wisniewski, K., Woldt, C., Schöne, K., Abel, A., Blas- chitz, V., Štindlová, B., and Vodičková, K. (2014). The The corpus could never be built without many other mem- MERLIN annotation scheme for the annotation of Ger- bers of the CzeSL team. For the work reported here the man, Italian, and Czech learner language. Technical re- author is grateful especially to Barbora Štindlová, Jirka port. Available online http://merlin-platform.eu/. Hana and Tomáš Jelínek. The author’s thanks are also due to two anonymous reviewers who helped to improve the Šebesta, K. (2010). Korpusy češtiny a osvojování jazyka paper, and to the Grant Agency of the Czech Republic, [Corpora of Czech and language acquistion]. Studie which currently provides financial support for Non-native z aplikované lingvistiky/Studies in Applied Linguistics, Czech from the Theoretical and Computational Perspec- 1:11–34. tive (project ID 16-10185S). Štindlová, B. (2011a). Evaluace chybové anotace navržené pro žákovský korpus češtiny. SALi, 2(2):37–60. References Štindlová, B. (2011b). Evaluace chybové anotace v Boyd, A., Hana, J., Nicolas, L., Meurers, D., Wisniewski, žákovském korpusu češtiny [Evaluation of Error Mark- K., Abel, A., Schöne, K., Štindlová, B., and Vettori, C. Up in a Learner Corpus of Czech]. PhD thesis, Charles (2014). The MERLIN corpus: Learner language and University, Faculty of Arts, Prague. the CEFR. In Calzolari, N., Choukri, K., Declerck, T., Štindlová, B., Rosen, A., Hana, J., and Škodová, S. (2012). Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., CzeSL – an error tagged corpus of Czech as a sec- Odijk, J., and Piperidis, S., editors, Proceedings of the ond language. In P˛ezik, P., editor, Corpus Data across Ninth International Conference on Language Resources Languages and Disciplines, volume 28 of Łódź Studies and Evaluation (LREC’14), Reykjavik, Iceland. Euro- in Language, pages 21–32, Frankfurt am Main. Peter pean Language Resources Association (ELRA). Lang. Hana, J., Rosen, A., Škodová, S., and Štindlová, B. (2010). Error-tagged learner corpus of Czech. In Proceedings of the Fourth Linguistic Annotation Workshop, Uppsala, Sweden. Association for Computational Linguistics. Hana, J., Rosen, A., Štindlová, B., and Štěpánek, J. (2014). Building a learner corpus. Language Resources and Evaluation, 48(4):741–752. Jelínek, T., Štindlová, B., Rosen, A., and Hana, J. (2012). Combining manual and automatic annotation of a learner corpus. In Sojka, P., Horák, A., Kopeček, I., and Pala, K., editors, Text, Speech and Dialogue – Proceed- ings of the 15th International Conference TSD 2012, number 7499 in Lecture Notes in Computer Science, pages 127–134. Springer. Ramasamy, L., Rosen, A., and Straňák, P. (2015). Im- provements to Korektor: A case study with native and non-native Czech. In Yaghob, J., editor, ITAT 2015: Information technologies – Applications and Theory / SloNLP 2015, pages 73–80, Prague. Charles University in Prague. Richter, M., Straňák, P., and Rosen, A. (2012). Korektor – a system for contextual spell-checking and diacritics completion. In Proceedings of COLING 2012: Posters, pages 1019–1028, Mumbai, India. The COLING 2012 Organizing Committee. Rosen, A., Hana, J., Štindlová, B., and Feldman, A. (2014). Evaluating and automating the annotation of a learner corpus. Language Resources and Evalua- tion – Special Issue: Resources for language learning, 48(1):65–92. 86 A. Rosen Corpus Size (MW) L1 L2 Level Medium Annotation ICLE 3 26 en advanced written part CLC 35 130 en all written part LINDSEI 0.8 11 en advanced spoken part PELCRA 0.5 pl en all written part USE 1.2 sv en advanced written no HKUST 25 zh en advanced written part CHUNGDAHM 131 ko en all written part JEFLL 0.7 jp en beginners written part MELD 1 16 en advanced written no MICASE 1.8 various en advanced spoken no NICT JLE 2 jp en all spoken part RusLTC 1.5 ru en advanced written no FALKO 0.3 5 de advanced written part FRIDA 0.2 various fr med-adv spoken part FLLOC 2 en fr all spoken no PiKUST 0.04 18 sl advanced written yes ASU 0.5 various no advanced written no TUFS 0.6 Mchars various jp all written no Table 1: A list of learner corpora around the world Non-native Ethnolect TOTAL Annotation Metadata Essays Theses CzeSL-plain 1315 732 428 2475 no no CzeSL-SGT 1147 1147 auto yes CzeSL-man v.0, a1 134 192 326 manual no CzeSL-man v.0, a2 59 149 208 manual no CzeSL-man v.1 134 134 manual yes Table 2: Available releases of CzeSL Bojal jsme že ona se ne bude libila slavnou prahu , proto to bylo velmí vadí pro mně . *feared aux that she rflx not will *like famous Prague , therefore it was *very resent for me . incorBase incorInfl wbdPre incorBase proto to bylo velmi vadí pro mně . Bál jsme že ona se nebude líbila slavnou Prahu , lex vbx dep agr rflx dep vbx agr,sec dep Bál jsem se , že se jí nebude líbit slavná Praha , protože to by mi velmi vadilo . that she would not like the famous city of Prague, because I would be very unhappy about it. I was afraid Figure 1: Two-level manual annotation of a sentence in CzeSL, the English glosses are added Building and Using Corpora of Non-Native Czech 87 word lemma tag word1 lemma1 tag1 gs err Tén Tén X@------------- Ten ten PDYS1---------- S Quant1 pes pes NNMS1-----A---- pes pes NNMS1-----A---- míluje míluje X@------------- miluje milovat VB-S---3P-AA--- S Quant1 svécho svécho X@------------- svého svůj P8MS4---------- S Voiced kamarada kamarada X@------------- kamaráda kamarád NNMS4-----A---- S Quant0 - - Z:------------- - - Z:------------- člověka člověk NNMS2-----A---- člověka člověk NNMS4-----A---- . . Z:------------- . . Z:------------- Table 3: Annotation of a sample sentence in CzeSL-SGT