=Paper=
{{Paper
|id=Vol-1649/80
|storemode=property
|title=Building and Using Corpora of Non-Native Czech
|pdfUrl=https://ceur-ws.org/Vol-1649/80.pdf
|volume=Vol-1649
|authors=Alexandr Rosen
|dblpUrl=https://dblp.org/rec/conf/itat/Rosen16
}}
==Building and Using Corpora of Non-Native Czech==
ITAT 2016 Proceedings, CEUR Workshop Proceedings Vol. 1649, pp. 80–87
http://ceur-ws.org/Vol-1649, Series ISSN 1613-0073, c 2016 A. Rosen
Building and using corpora of non-native Czech
Alexandr Rosen
Institute of Theoretical and Computational Linguistics, Faculty of Arts
Charles University in Prague
1 Introduction The tabular format is also used in MERLIN, one of the
two currently available corpora including Czech.2 In ad-
Investigating language acquisition by non-native learners dition to 64.5K words of Czech in CEFR levels A1–C1,
helps to understand important linguistic issues and develop the corpus includes also German and Italian. It is tagged,
teaching methods, better suited both to the specific target lemmatized, parsed and on-line searchable, with a detailed
language and to the learner. These tasks can now be based error taxonomy and the option of two target hypotheses.
on empirical evidence from learner corpora.
A learner corpus consists of language produced by lan-
guage learners, typically learners of a second or foreign 3 CzeSL – the learner corpus of Czech as a
language (L2). Such corpora may be equipped with mor- Second Language
phological and syntactic annotation, together with the de-
tection, correction and categorization of non-standard lin- CzeSL is a part of an umbrella project, the Acquisition
guistic phenomena. Corpora of Czech (AKCES), a research programme pur-
The tasks of designing, compiling, annotating and pre- sued since 2005 (Šebesta, 2010). In addition to CzeSL,
senting such corpora are often very much unlike those rou- AKCES has a written (SKRIPT) and spoken (SCHOLA)
tinely applied to standard corpora. There may be no stan- part collected from native Czech pupils, and ROMi, a part
dard or obvious solutions: the approach to the tasks is of- collected from pupils with Romani background, using the
ten seen as an answer to a specific research goal rather Romani ethnolect of Czech as their first language (L1). In
than as a service to a wider community of researchers and the present paper we focus on written texts produced by
practitioners. Our aim is to investigate some of the chal- non-native learners of Czech. However, most of the meth-
lenges, based on a learner corpus of Czech in comparison ods and tools can be applied to other parts of the corpus.
to several other learner corpora. CzeSL is focused on native speakers of three main lan-
After an overview of learner corpora around the world guage groups: (1) Slavic, (2) other Indo-European, (3)
in §2 and a brief presentation of several releases of a non-Indo-European. The hand-written texts cover all lan-
learner corpus of Czech in §3, we examine issues inherent guage levels, from real beginners (A1) to advanced learn-
to the process of compiling, annotating and using such cor- ers (B2, C1, C2). The texts are equipped with metadata
pora, including automatic identification of errors, the de- records; some of them relate to the respondent (age, gen-
sign and application of error taxonomy, and a user-friendly der, first language, proficiency in Czech, knowledge of
search tool, suited to a complex annotation (§4). other languages, duration and conditions of language ac-
quisition), while other specify the character of the text and
circumstances of its production (availability of reference
2 About learner corpora tools, type of elicitation, temporal and size restrictions
etc.).
Most of the existing learner corpora include English (L2) The hand-written texts were transcribed using off-the-
as produced by students whose native languages (L1) are shelf editors supporting HTML (e.g., Microsoft Word or
varied. Most of the corpora are partially error-annotated, Open Office Writer). A set of codes was used to cap-
see Table 1 on p. .1 The error annotation is usually in- ture variants, illegible strings, self-corrections; for details
line, equivalent to XML tags, denoting the scope, correc- see (Štindlová, 2011b, p. 106ff). During the transcrip-
tion and categorization of an error. A few corpora such tion step, the texts were anonymized by replacing personal
as FALKO include multi-layered annotation in a tabular names with appropriate forms of Adam and Eva. Names
format, with the option of specifying multiple target hy- of smaller places (streets, villages, small towns) and other
potheses (corrections) and several error types for single potentially sensitive data were replaced by QQQ. Unread-
word tokens or strings thereof at different levels of linguis- able characters or words were transcribed as XXX.
tic abstraction: orthography, morphology, syntax, lexicon, The transcripts were converted into an XML format.
pragmatics, intelligibility. Some of them were corrected (‘emended’) and labelled
2 Multilingual Platform for European Reference Levels: Interlan-
1 For a more extensive overview see Štindlová (2011a) or an actively guage Exploration in Context, see http://merlin-platform.eu and Wis-
maintained list at https://www.uclouvain.be/en-cecl-lcworld.html. niewski et al. (2014); Boyd et al. (2014)
Building and Using Corpora of Non-Native Czech 81
by error categories using a custom-built annotation edi- lation.8 The level of transcribed input (Tier 0) is followed
tor, supporting a two-layered annotation format with m : n by the level of orthographical and morphemic corrections
links between tokens at the neighbouring tiers.3 In a post- (Tier 1), where only forms incorrect in any context are
processing step the hand-annotated texts were tagged by treated. Errors at Tier 1 are mainly non-word errors while
tools trained on native Czech in a way similar to stan- those at Tier 2 are real-word and grammatical errors. How-
dard corpora, i.e. by lemmas, morphosyntactic categories, ever, a faulty form that happens to be spelled as a form
in some (currently non-public) releases of the corpus also which would be correct in a different context, is still cor-
by syntactic functions and structure. Some error annota- rected at Tier 1. The result at Tier 1 is a string consist-
tion tasks were also done automatically: the assignment of ing of correct Czech forms, even though the sentence may
formal error labels and even the correction step (the latter not be correct as a whole. All other types of errors are
in Czesl-SGT, see §3.2). corrected at Tier 2, representing a grammatically correct,
There are several public releases of CzeSL, which dif- though stylistically not necessarily optimal target hypothe-
fer in the depth and method of annotation, but also in the sis.9 Manual annotation is complemented by morphosyn-
availability of metadata and size. Table 2 shows the con- tactic tags and lemmas at Tier 2, ambiguously specified
tent of available releases of CzeSL, including the volumes tags and lemmas at Tier 1, and automatically identified for-
(in thousands of tokens), and the availability of annotation mal errors.10 Splitting, joining and reordering words, to-
and metadata.4 gether with the pointers may make the picture rather com-
plex, as in an authentic sentence in Figure 1 on p. .
The three tiers are represented as parallel strings of
3.1 Releases of CzeSL without metadata:
word forms with links for corresponding forms. Tier 0
CzeSL-plain and CzeSL-man v. 0
is glossed for readability; forms marked by asterisks are
Since 2012, the transcripts of essays hand-written by non- incorrect in any context.
native learners (1.3 mil. tokens) and pupils speaking the Errors corrected at Tier 1 include incorrect inflec-
Romani ethnolect of Czech (0.4 mil. tokens) have been tion (incorInfl), word boundaries (wbdPre), and stems
available together with some Bachelor and Master the- (incorBase). Errors in punctuation (the missing comma),
ses written in Czech by foreign students (0.7 mil. tokens) capitalization (prahu) or word order (se in the that-clause
as the CzeSL-plain corpus, on-line searchable via a web- at Tier 2) are tagged automatically in a post-processing
based search interface of the Czech National Corpus,5 or step.
as full texts under the Creative Commons license from Tier 2 captures the rest of errors. Some error labels are
the LINDAT repository.6 Except for specifying the three linked to a token which makes the reason for the correc-
groups above and a basic structural mark-up, this corpus tion explicit. This includes errors in agreement (agr), gov-
does not include any metadata or annotation. ernment or valency in a broad sense (dep), complex verb
CzeSL-man v. 0 includes subsets of CzeSL and ROMi, forms (vbx) or reflexive particles (rflx). For example, ona
about 330 thousand tokens. It is manually error-annotated in the nominative case is governed by the form líbit se, and
at two levels. Texts of about 208 thousand tokens are anno- should be in the dative case: jí. The label dep has an ar-
tated independently by two annotators. Like CzeSL-plain, row pointing to the governor líbit. There is also a simple
the whole hand-annotated part is accessible online with- lexical correction: Proto ‘therefore’ is changed to protože
out metadata via a purpose-built search tool (SeLaQ);7 for ‘because’.
more about the manual annotation and the annotation pro- However, the main issue are the two finite verbs bylo
cess see Hana et al. (2014). and vadí. The most likely intention of the author is best ex-
The manual annotation scheme in CzeSL is based on pressed by the conditional mood. The two non-contiguous
a two-stage annotation design, reflecting the distinction forms are replaced by the conditional auxiliary and the
roughly between errors in orthography and morphemics content verb participle in one step using a 2:2 relation.
on the one hand and all other error types on the other. To- Another complex issue is the prepositional phrase pro mně
kens in the original transcript are linked with their coun- ‘for me’. Its proper form is pro mě (homonymous with pro
terparts at the two successive levels by edges, possibly mně, but with ‘me’ in accusative instead of dative), or pro
labelled with the type of error – see Figure 1 on p. . A mne. The accusative case is required by the preposition
syntactic error label may be linked by a pointer to a word pro. However, the head verb requires that this comple-
token, specifying an agreement, valency or referential re- ment bears bare dative – mi. Additionally, this form is a
8 This scheme is already a compromise between a linear annotation
and an open multi-layered format, but a compromise preserving links be-
3 https://bitbucket.org/jhana/feat
tween split, joined and re-ordered tokens, corrected in two stages simul-
4 Some texts in CzeSL-man v.0 are doubly annotated. The texts an-
taneously, something not obviously supported in the multilayered tabular
notated by an additional annotator are included in the CzeSL-man v.0, a2 format mentioned above in §2.
part. See http://utkl.ff.cuni.cz/learncorp/ for links and more details. 9 See Hana et al. (2010) and Rosen et al. (2014) for more details.
5 https://kontext.korpus.cz 10 See Jelínek et al. (2012) for details, including a list of formal error
6 http://lindat.mff.cuni.cz
types. The last column of Table 3 shows examples of the formal error
7 http://chomsky.ruk.cuni.cz:5125 labels.
82 A. Rosen
clitic, following the conditional auxiliary. error at Tier 1 (62%), a grammar error at Tier 2 (27%),
The correction slavnouaccusative →slavnánominative is due or an accumulated error at both tiers (11%). Form errors
to the correction of the case of the head noun. Such cor- were detected with a success rate of 89%. For grammar er-
rections receive an additional label as secondary errors. rors (real-word errors) the detection rate was much lower,
about 15.5%. The detection of accumulated errors was
similar to form errors (89%).
3.2 The automatically anotated CzeSL-SGT
After all the automatic annotation steps are finished,
The ‘real’ CzeSL, i.e. the corpus consisting of essays writ- each token is labelled by the following attributes:
ten only by non-native learners (1.1 mil. tokens), is avail-
able with automatic annotation as CzeSL-SGT,11 extend- • word – original word form
ing the “foreign” part of the CzeSL-plain corpus by texts • lemma – lemma of word; same as word if the form is
collected in 2013. This was the first release of CzeSL in- not recognized
cluding full metadata. The corpus includes 8,617 texts by
1,965 different authors with 54 different first languages. • tag – morphological tag of word; if the form is not
The original transcription markup is discarded in this cor- recognized: X@-------------
pus, while the final author’s version is restored. The cor-
pus is available again either for on-line searching using • word1 – corrected form; same as word if determined
the search interface of the Czech National Corpus or for as correct
download from the LINDAT data repository.12 • lemma1 – lemma of word1
Word forms are tagged by word class, morphological
categories and base forms (lemmas). Some forms are cor- • tag1 – morphological tag of word1
rected by Korektor, a context-sensitive spelling/grammar
checker,13 and the resulting texts are tagged again. Origi- • gs – information on whether the error was deter-
nal and corrected forms are compared and error labels are mined as a spelling (S) or grammar (G) error; for
assigned. Korektor detected and corrected 13.24% incor- grammar errors, word is mostly recognized
rect forms, 10.33% labelled as including a spelling error, • err – error type, determined by comparing word and
and 2.92% an error in grammar, i.e. a ‘real-word’ error. word1.
Both the original, uncorrected texts and their corrected
version were tagged and lemmatized, and “formal error Table 3 on p. shows the use of the annotation in a sim-
tags,” based on the comparison of the uncorrected and cor- ple sentence (1).15
rected forms, were assigned.14 The share of non-words de-
tected by the tagger is slightly lower – 9.23% (the tagger (1) Tén pes míluje svécho kamarada – člověka.
uses a larger lexicon). that dog loves self’s friend – man
Automatic correction is a crucial annotation step. The ‘That dog loves its friend – the man.’
tool is concerned mainly with errors in orthography and
In addition to the attributes listed above, the search in-
morphemics, and handles some errors in morphosyntax,
terface of the Czech National Corpus offers “dynamic” at-
including real-word errors (i.e. errors that produce a word
tributes, derived from some positions of tag and tag1.
which seems to be correct out of context), as long as they
Dynamic attributes can be used in queries to specify val-
are detectable locally, within a reasonably small window
ues of morphological categories without regular expres-
of n-grams. Corrections are limited to single words, tar-
sions, to stipulate identity of these values in two or more
getting a single character or a very small number of char-
forms to require grammatical concord, or to compare val-
acters by insertion, omission, substitution, transposition,
ues of a category for word and word1. These attributes
addition, deletion or substitution of a diacritic. Errors that
are available for the following categories of the original
involve joining or splitting of word tokens or word-order
and the corrected form:
errors of any type are not handled at the moment.
The performance of Korektor was evaluated first in • k, k1 – word class (position 1 of the tag)
Štindlová et al. (2012) with about 20% error rate on the
set of non-words, and later in Ramasamy et al. (2015). In • s, s1 – detailed word class (position 2 of the tag)
an optimal setting of the model, the best results achieved
in terms of F1 score were 95.4% for error detection and • g, g1 – gender (position 3 of the tag)
91.0% for error correction. In a manual analysis of 3000 • n, n1 – number (position 4 of the tag)
tokens, about 23% of the tokens included either a form
• c, c1 – case (position 5 of the tag)
11 Czech as a Second Language with Spelling, Grammar and Tags
12 http://hdl.handle.net/11234/1-162
13 See Richter et al. (2012). The tool is available from the LINDAT 15 The example comes from a CzeSL-SGT text, written by a 17 years
repository (https://lindat.mff.cuni.cz) under the FreeBSD license. old student, with Russian as L1 and B2 as the proficiency level in Czech
14 See Jelínek et al. (2012). (document ID ttt_G1_434).
Building and Using Corpora of Non-Native Czech 83
• p, p1 – person (position 8 of the tag) CzeSL-SGT CzeSL-man v. 1
Texts 8,600 645
They are meant especially for CQL queries16 including Sentences 111K 11K
a “global condition”. As in standard corpora, such queries
Words 958K 104K
target two or more word tokens with an arbitrary but equal
Tokens 1,148K 128K
value of an attribute such as case to express grammatical
agreement and similar morphosyntactic phenomena (2). Different authors 1,965 262
Different L1s 54 32
(2) 1:[] 2:[] & 1.c = 2.c Proficiency levels A1–C2 A1–C1
In a learner corpus, such queries make sense even for a Women/Men 5:3 3:2
single word token, e.g. for expressing identical or distinct Words per text 100–200 100–200
values of the morphological case of the original form and
Table 5: CzeSL-man v. 1 and CzeSL-SGT compared
of its corrected version (3).17
(3) 1:[] & 1.c != 1.c1 S IE nIE unknown Σ
A1 49 6 4 59
In a learner corpus, metadata about the author of the text A1+ 3 3
are at least as important as all other types of annotation. A2 18 26 67 111
For the number of texts authored by students according
A2+ 81 9 59 149
to their first language and the CEFR proficiency level in
B1 123 26 30 179
Czech see Table 4 below. The language group abbrevia-
tions read as follows: IE = non-Slavic Indo-European, nIE B2 102 11 15 128
= non-Indo-European, S = Slavic. C1 10 2 12
unknown 4 4
S IE nIE unknown Σ Σ 383 78 180 4 645
A1 1783 199 622 5 2609
Table 6: Number of texts by language group and profi-
A1+ 283 21 11 0 315 ciency level in CzeSL-man v. 1
A2 1348 269 480 1 2098
A2+ 403 54 113 0 570 In addition to the number of tokens for the same cate-
B1 929 195 357 0 1481 gory, Table 8 shows also the frequency of errors of the dep
B2 523 115 107 0 745 type, i.e. valency errors in the broad sense, including er-
C1 82 17 24 0 123 rors in the number of complements and adjuncts or errors
C2 0 1 0 0 1 in their morphosyntactic expression. The rather frequent
error type shows a considerable and expected decrease in
unknown 291 27 33 324 675 higher proficiency levels
Σ 5642 898 1747 330 8617 CzeSL-man v. 1 is about to be released soon for down-
load in the LINDAT repository and for on-line searching
Table 4: Number of texts by language group and profi- in https://kontext.korpus.cz. Some solutions to the prob-
ciency level in CzeSL-SGT lem of using a feature-rich corpus search engine, which
is still not suited to the two-level annotation scheme of
CzeSL-man, are presented in 4.
3.3 CzeSL-man v. 1
CzeSL-man v. 1 is a collection of manually annotated tran- 4 Some issues and lessons learnt
scripts of essays of non-native speakers of Czech, written
in 2009–2013, the total of 645 texts, including 298 doubly Several points can be made about some of the CzeSL re-
annotated texts. The texts contain 128 thousand word to- leases, reflecting issues involved in the design, compila-
kens, including 59 thousand doubly annotated tokens; for tion and presentation of learner corpora.
a comparison with CzeSL-SGT see Table 5. We start with CzeSL-plain and its hand-annotated part
Tables 6 and 7 show the number of texts for each com- CzeSL-man v. 0: (i) Both corpora include some ROMi
bination of CEFR level and language group in CzeSL-man texts, actually produced by native speakers of a dialect
v. 1. of Czech, rather than by non-native speakers of Czech.
This is due to the original strategy of grouping texts by
16 See https://www.sketchengine.co.uk/corpus-querying/
the way they are processed. This has been changed in later
17 Unfortunately, queries including global conditions on dynamic at- releases, where texts produced by non-native and native
tributes do not produce expected results in the present version of the Man- learners (the latter including speakers of the Romani eth-
atee search engine. nolect of Czech) are parts of distinct corpora. (ii) Neither
84 A. Rosen
S IE nIE Σ The Manatee corpus search engine, used in the Czech
A1 37 2 1 40 National Corpus, and its (No)Sketch Engine front end ac-
A1+ 3 3 tually include support for learner corpora,18 . The in-line
A2 5 23 47 75 annotation can even have embedded structures, which may
A2+ 21 6 49 76 be used at least for some cases of multi-layered annotation.
Making CzeSL-man with most of the annotation available
B1 20 23 28 71
this way thus seems a real prospect.
B2 7 11 12 30
C1 1 2 3
Σ 91 65 142 298 4.1 Corpus design and planning
Table 7: Number of doubly annotated texts by language The target corpus may be intended for a group of users
group and proficiency level in CzeSL-man v. 1 with specific research or practical needs, or for a wide
audience of language acquisition experts, researchers or
practitioners. In any case the goals should be realistic
A1 A2 B1 B2 C1 Σ
in order to avoid a mission ending before the goals are
IE 227 7,336 5,311 2,340 0 15,214
achieved.
dep 13 361 118 28 0 520
%dep 5.73% 4.92% 2.22% 1.20% 3.42%
nIE 439 17,640 7,606 4,219 760 30,664 4.2 Text acquisition
dep 13 715 237 116 7 1,088
%dep 2.96% 4.05% 3.12% 2.75% 0.92% 3.55% Some balance or at least representative proportions of text
S 6,434 16,939 27,226 22,173 4,761 77,533 and learner categories are necessary or at least useful. Ta-
dep 225 470 652 443 17 1,807 bles 4–7 show an opposite, opportunistic approach, driven
%dep 3.50% 2.77% 2.39% 2.00% 0.36% 2.33% by practical constraints, often justified by the unavailablity
Σ 7,100 41,915 40,143 28,732 5,521 123,411 of texts of a specific category.
dep 251 1,546 1,007 587 24 3,415
%dep 3.54% 3.69% 2.51% 2.04% 0.43% 2.77%
4.3 Transcription
Table 8: Number of tokens and valency errors by language
To avoid the need of cleaning transcripts with improperly
group and proficiency level in CzeSL-man v. 1
used mark-up, an editing tool including strict format con-
trols is preferable to a free-text editor.
CzeSL-plain nor CzeSL-man v. 0 includes the full set of
metadata, which were not available in the appropriate form 4.4 Annotation scheme and searching
and content at the time the two corpora were prepared and
released. In CzeSL-plain, the texts are categorized into A scheme ideally suited to the data may turn into a prob-
three groups: as essays, written either by non-native learn- lem later, if the consequences for the annotation process
ers, or by speakers of the Roma ethnolect of Czech, and as and the use of the corpus are not foreseen. Standard con-
theses written by non-native students. In CzeSL-man v. 0 cordancers may require substantial tweaking of the data,
there is no distiction available. (iii) Due to the uncertainty while a custom-built tool may lack features of the tools
abouth the optimal way of representing the complex two- developed for a long time. At the same time, most users of
level manual annotation, the SeLaQ tool cannot display the this type of corpora definitely need a friendly interface.
two-level annotation format in a graphical format.
There is a strong demand for CzeSL-man to become 5 Conclusion
available for on-line searches at the Czech National Cor-
pus portal, even if some of the properties and information We have presented several releases of a learner corpus of
present in the corpus may get lost in the conversion to the Czech, available for on-line queries and under the Creative
format used by the corpus search tool, based on the single- Commons license as full texts.
level annotation of a string of tokens. However, the con- In order to reach its goals and become useful, a learner
verted format might still retain enough annotation to be at- corpus project should be conceived carefully, considering
tractive and useful for most tasks. Instead of assigning the many factors. By way of an example, we have shown some
error-related annotation to word tokens, which makes the pitfalls in the process of building and presenting such a
option to annotate strings of tokens, or even discontinuous corpus.
strings very difficult, errors and corrections can be treated The methods and tools developed within this project are
as structural annotation, i.e. similarly to the markup for not tied to the specific use and we hope they will be found
paragraphs, sentences, phrases or text chunks. Even the useful in other projects.
splitting and joining of words and word order corrections
can then be expressed. 18 See https://www.sketchengine.co.uk/learner-corpus-functionality/
Building and Using Corpora of Non-Native Czech 85
Acknowledgements Wisniewski, K., Woldt, C., Schöne, K., Abel, A., Blas-
chitz, V., Štindlová, B., and Vodičková, K. (2014). The
The corpus could never be built without many other mem- MERLIN annotation scheme for the annotation of Ger-
bers of the CzeSL team. For the work reported here the man, Italian, and Czech learner language. Technical re-
author is grateful especially to Barbora Štindlová, Jirka port. Available online http://merlin-platform.eu/.
Hana and Tomáš Jelínek. The author’s thanks are also due
to two anonymous reviewers who helped to improve the Šebesta, K. (2010). Korpusy češtiny a osvojování jazyka
paper, and to the Grant Agency of the Czech Republic, [Corpora of Czech and language acquistion]. Studie
which currently provides financial support for Non-native z aplikované lingvistiky/Studies in Applied Linguistics,
Czech from the Theoretical and Computational Perspec- 1:11–34.
tive (project ID 16-10185S).
Štindlová, B. (2011a). Evaluace chybové anotace navržené
pro žákovský korpus češtiny. SALi, 2(2):37–60.
References
Štindlová, B. (2011b). Evaluace chybové anotace v
Boyd, A., Hana, J., Nicolas, L., Meurers, D., Wisniewski, žákovském korpusu češtiny [Evaluation of Error Mark-
K., Abel, A., Schöne, K., Štindlová, B., and Vettori, C. Up in a Learner Corpus of Czech]. PhD thesis, Charles
(2014). The MERLIN corpus: Learner language and University, Faculty of Arts, Prague.
the CEFR. In Calzolari, N., Choukri, K., Declerck, T.,
Štindlová, B., Rosen, A., Hana, J., and Škodová, S. (2012).
Loftsson, H., Maegaard, B., Mariani, J., Moreno, A.,
CzeSL – an error tagged corpus of Czech as a sec-
Odijk, J., and Piperidis, S., editors, Proceedings of the
ond language. In P˛ezik, P., editor, Corpus Data across
Ninth International Conference on Language Resources
Languages and Disciplines, volume 28 of Łódź Studies
and Evaluation (LREC’14), Reykjavik, Iceland. Euro-
in Language, pages 21–32, Frankfurt am Main. Peter
pean Language Resources Association (ELRA).
Lang.
Hana, J., Rosen, A., Škodová, S., and Štindlová, B. (2010).
Error-tagged learner corpus of Czech. In Proceedings
of the Fourth Linguistic Annotation Workshop, Uppsala,
Sweden. Association for Computational Linguistics.
Hana, J., Rosen, A., Štindlová, B., and Štěpánek, J. (2014).
Building a learner corpus. Language Resources and
Evaluation, 48(4):741–752.
Jelínek, T., Štindlová, B., Rosen, A., and Hana, J. (2012).
Combining manual and automatic annotation of a
learner corpus. In Sojka, P., Horák, A., Kopeček, I., and
Pala, K., editors, Text, Speech and Dialogue – Proceed-
ings of the 15th International Conference TSD 2012,
number 7499 in Lecture Notes in Computer Science,
pages 127–134. Springer.
Ramasamy, L., Rosen, A., and Straňák, P. (2015). Im-
provements to Korektor: A case study with native and
non-native Czech. In Yaghob, J., editor, ITAT 2015:
Information technologies – Applications and Theory /
SloNLP 2015, pages 73–80, Prague. Charles University
in Prague.
Richter, M., Straňák, P., and Rosen, A. (2012). Korektor
– a system for contextual spell-checking and diacritics
completion. In Proceedings of COLING 2012: Posters,
pages 1019–1028, Mumbai, India. The COLING 2012
Organizing Committee.
Rosen, A., Hana, J., Štindlová, B., and Feldman, A.
(2014). Evaluating and automating the annotation of
a learner corpus. Language Resources and Evalua-
tion – Special Issue: Resources for language learning,
48(1):65–92.
86 A. Rosen
Corpus Size (MW) L1 L2 Level Medium Annotation
ICLE 3 26 en advanced written part
CLC 35 130 en all written part
LINDSEI 0.8 11 en advanced spoken part
PELCRA 0.5 pl en all written part
USE 1.2 sv en advanced written no
HKUST 25 zh en advanced written part
CHUNGDAHM 131 ko en all written part
JEFLL 0.7 jp en beginners written part
MELD 1 16 en advanced written no
MICASE 1.8 various en advanced spoken no
NICT JLE 2 jp en all spoken part
RusLTC 1.5 ru en advanced written no
FALKO 0.3 5 de advanced written part
FRIDA 0.2 various fr med-adv spoken part
FLLOC 2 en fr all spoken no
PiKUST 0.04 18 sl advanced written yes
ASU 0.5 various no advanced written no
TUFS 0.6 Mchars various jp all written no
Table 1: A list of learner corpora around the world
Non-native
Ethnolect TOTAL Annotation Metadata
Essays Theses
CzeSL-plain 1315 732 428 2475 no no
CzeSL-SGT 1147 1147 auto yes
CzeSL-man v.0, a1 134 192 326 manual no
CzeSL-man v.0, a2 59 149 208 manual no
CzeSL-man v.1 134 134 manual yes
Table 2: Available releases of CzeSL
Bojal jsme že ona se ne bude libila slavnou prahu , proto to bylo velmí vadí pro mně .
*feared aux that she rflx not will *like famous Prague , therefore it was *very resent for me .
incorBase
incorInfl wbdPre incorBase
proto to bylo velmi vadí pro mně .
Bál jsme že ona se nebude líbila slavnou Prahu ,
lex vbx dep
agr rflx dep vbx agr,sec dep
Bál jsem se , že se jí nebude líbit slavná Praha , protože to by mi velmi vadilo .
that she would not like the famous city of Prague, because I would be very unhappy about it.
I was afraid
Figure 1: Two-level manual annotation of a sentence in CzeSL, the English glosses are added
Building and Using Corpora of Non-Native Czech 87
word lemma tag word1 lemma1 tag1 gs err
Tén Tén X@------------- Ten ten PDYS1---------- S Quant1
pes pes NNMS1-----A---- pes pes NNMS1-----A----
míluje míluje X@------------- miluje milovat VB-S---3P-AA--- S Quant1
svécho svécho X@------------- svého svůj P8MS4---------- S Voiced
kamarada kamarada X@------------- kamaráda kamarád NNMS4-----A---- S Quant0
- - Z:------------- - - Z:-------------
člověka člověk NNMS2-----A---- člověka člověk NNMS4-----A----
. . Z:------------- . . Z:-------------
Table 3: Annotation of a sample sentence in CzeSL-SGT