<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Building and using corpora of non-native Czech</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Alexandr</forename><surname>Rosen</surname></persName>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Institute of Theoretical and Computational Linguistics</orgName>
								<orgName type="department" key="dep2">Faculty of Arts</orgName>
								<orgName type="institution">Charles University in</orgName>
								<address>
									<settlement>Prague</settlement>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Building and using corpora of non-native Czech</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">6AAC74D8F17DE50BDD2A4120576B9128</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T17:43+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract/>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Investigating language acquisition by non-native learners helps to understand important linguistic issues and develop teaching methods, better suited both to the specific target language and to the learner. These tasks can now be based on empirical evidence from learner corpora.</p><p>A learner corpus consists of language produced by language learners, typically learners of a second or foreign language (L2). Such corpora may be equipped with morphological and syntactic annotation, together with the detection, correction and categorization of non-standard linguistic phenomena.</p><p>The tasks of designing, compiling, annotating and presenting such corpora are often very much unlike those routinely applied to standard corpora. There may be no standard or obvious solutions: the approach to the tasks is often seen as an answer to a specific research goal rather than as a service to a wider community of researchers and practitioners. Our aim is to investigate some of the challenges, based on a learner corpus of Czech in comparison to several other learner corpora.</p><p>After an overview of learner corpora around the world in §2 and a brief presentation of several releases of a learner corpus of Czech in §3, we examine issues inherent to the process of compiling, annotating and using such corpora, including automatic identification of errors, the design and application of error taxonomy, and a user-friendly search tool, suited to a complex annotation ( §4).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">About learner corpora</head><p>Most of the existing learner corpora include English (L2) as produced by students whose native languages (L1) are varied. Most of the corpora are partially error-annotated, see Table <ref type="table" target="#tab_7">1</ref> on p. . <ref type="bibr">1</ref> The error annotation is usually inline, equivalent to XML tags, denoting the scope, correction and categorization of an error. A few corpora such as FALKO include multi-layered annotation in a tabular format, with the option of specifying multiple target hypotheses (corrections) and several error types for single word tokens or strings thereof at different levels of linguistic abstraction: orthography, morphology, syntax, lexicon, pragmatics, intelligibility.</p><p>1 For a more extensive overview see Štindlová (2011a) or an actively maintained list at https://www.uclouvain.be/en-cecl-lcworld.html.</p><p>The tabular format is also used in MERLIN, one of the two currently available corpora including Czech. <ref type="bibr">2</ref> In addition to 64.5K words of Czech in CEFR levels A1-C1, the corpus includes also German and Italian. It is tagged, lemmatized, parsed and on-line searchable, with a detailed error taxonomy and the option of two target hypotheses.</p><p>3 CzeSL -the learner corpus of Czech as a Second Language</p><p>CzeSL is a part of an umbrella project, the Acquisition Corpora of Czech (AKCES), a research programme pursued since 2005 (Šebesta, 2010). In addition to CzeSL, AKCES has a written (SKRIPT) and spoken (SCHOLA) part collected from native Czech pupils, and ROMi, a part collected from pupils with Romani background, using the Romani ethnolect of Czech as their first language (L1). In the present paper we focus on written texts produced by non-native learners of Czech. However, most of the methods and tools can be applied to other parts of the corpus.</p><p>CzeSL is focused on native speakers of three main language groups: (1) Slavic, (2) other Indo-European, (3) non-Indo-European. The hand-written texts cover all language levels, from real beginners (A1) to advanced learners (B2, C1, C2). The texts are equipped with metadata records; some of them relate to the respondent (age, gender, first language, proficiency in Czech, knowledge of other languages, duration and conditions of language acquisition), while other specify the character of the text and circumstances of its production (availability of reference tools, type of elicitation, temporal and size restrictions etc.).</p><p>The hand-written texts were transcribed using off-theshelf editors supporting HTML (e.g., Microsoft Word or Open Office Writer). A set of codes was used to capture variants, illegible strings, self-corrections; for details see (Štindlová, 2011b, p. 106ff). During the transcription step, the texts were anonymized by replacing personal names with appropriate forms of Adam and Eva. Names of smaller places (streets, villages, small towns) and other potentially sensitive data were replaced by QQQ. Unreadable characters or words were transcribed as XXX.</p><p>The transcripts were converted into an XML format. Some of them were corrected ('emended') and labelled by error categories using a custom-built annotation editor, supporting a two-layered annotation format with m : n links between tokens at the neighbouring tiers. <ref type="bibr">3</ref> In a postprocessing step the hand-annotated texts were tagged by tools trained on native Czech in a way similar to standard corpora, i.e. by lemmas, morphosyntactic categories, in some (currently non-public) releases of the corpus also by syntactic functions and structure. Some error annotation tasks were also done automatically: the assignment of formal error labels and even the correction step (the latter in Czesl-SGT, see §3.2).</p><p>There are several public releases of CzeSL, which differ in the depth and method of annotation, but also in the availability of metadata and size. Table <ref type="table" target="#tab_8">2</ref> shows the content of available releases of CzeSL, including the volumes (in thousands of tokens), and the availability of annotation and metadata.<ref type="foot" target="#foot_2">4</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Releases of CzeSL without metadata:</head><p>CzeSL-plain and CzeSL-man v. 0</p><p>Since 2012, the transcripts of essays hand-written by nonnative learners (1.3 mil. tokens) and pupils speaking the Romani ethnolect of Czech (0.4 mil. tokens) have been available together with some Bachelor and Master theses written in Czech by foreign students (0.7 mil. tokens) as the CzeSL-plain corpus, on-line searchable via a webbased search interface of the Czech National Corpus,<ref type="foot" target="#foot_3">5</ref> or as full texts under the Creative Commons license from the LINDAT repository. <ref type="bibr">6</ref> Except for specifying the three groups above and a basic structural mark-up, this corpus does not include any metadata or annotation.</p><p>CzeSL-man v. 0 includes subsets of CzeSL and ROMi, about 330 thousand tokens. It is manually error-annotated at two levels. Texts of about 208 thousand tokens are annotated independently by two annotators. Like CzeSL-plain, the whole hand-annotated part is accessible online without metadata via a purpose-built search tool (SeLaQ);<ref type="foot" target="#foot_5">7</ref> for more about the manual annotation and the annotation process see Hana et al. (2014).</p><p>The manual annotation scheme in CzeSL is based on a two-stage annotation design, reflecting the distinction roughly between errors in orthography and morphemics on the one hand and all other error types on the other. Tokens in the original transcript are linked with their counterparts at the two successive levels by edges, possibly labelled with the type of error -see Figure <ref type="figure" target="#fig_1">1</ref> on p. . A syntactic error label may be linked by a pointer to a word token, specifying an agreement, valency or referential re-lation. <ref type="bibr">8</ref> The level of transcribed input (Tier 0) is followed by the level of orthographical and morphemic corrections (Tier 1), where only forms incorrect in any context are treated. Errors at Tier 1 are mainly non-word errors while those at Tier 2 are real-word and grammatical errors. However, a faulty form that happens to be spelled as a form which would be correct in a different context, is still corrected at Tier 1. The result at Tier 1 is a string consisting of correct Czech forms, even though the sentence may not be correct as a whole. All other types of errors are corrected at Tier 2, representing a grammatically correct, though stylistically not necessarily optimal target hypothesis.<ref type="foot" target="#foot_7">9</ref> Manual annotation is complemented by morphosyntactic tags and lemmas at Tier 2, ambiguously specified tags and lemmas at Tier 1, and automatically identified formal errors.<ref type="foot" target="#foot_8">10</ref> Splitting, joining and reordering words, together with the pointers may make the picture rather complex, as in an authentic sentence in Figure <ref type="figure" target="#fig_1">1</ref> on p. . The three tiers are represented as parallel strings of word forms with links for corresponding forms. Tier 0 is glossed for readability; forms marked by asterisks are incorrect in any context.</p><p>Errors corrected at Tier 1 include incorrect inflection (incorInfl), word boundaries (wbdPre), and stems (incorBase). Errors in punctuation (the missing comma), capitalization (prahu) or word order (se in the that-clause at Tier 2) are tagged automatically in a post-processing step.</p><p>Tier 2 captures the rest of errors. Some error labels are linked to a token which makes the reason for the correction explicit. This includes errors in agreement (agr), government or valency in a broad sense (dep), complex verb forms (vbx) or reflexive particles (rflx). For example, ona in the nominative case is governed by the form líbit se, and should be in the dative case: jí. The label dep has an arrow pointing to the governor líbit. There is also a simple lexical correction: Proto 'therefore' is changed to protože 'because'.</p><p>However, the main issue are the two finite verbs bylo and vadí. The most likely intention of the author is best expressed by the conditional mood. The two non-contiguous forms are replaced by the conditional auxiliary and the content verb participle in one step using a 2:2 relation. Another complex issue is the prepositional phrase pro mně 'for me'. Its proper form is pro mě (homonymous with pro mně, but with 'me' in accusative instead of dative), or pro mne. The accusative case is required by the preposition pro. However, the head verb requires that this complement bears bare dativemi. Additionally, this form is a clitic, following the conditional auxiliary.</p><p>The correction slavnou accusative →slavná nominative is due to the correction of the case of the head noun. Such corrections receive an additional label as secondary errors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">The automatically anotated CzeSL-SGT</head><p>The 'real' CzeSL, i.e. the corpus consisting of essays written only by non-native learners (1.1 mil. tokens), is available with automatic annotation as CzeSL-SGT, 11 extending the "foreign" part of the CzeSL-plain corpus by texts collected in 2013. This was the first release of CzeSL including full metadata. The corpus includes 8,617 texts by 1,965 different authors with 54 different first languages. The original transcription markup is discarded in this corpus, while the final author's version is restored. The corpus is available again either for on-line searching using the search interface of the Czech National Corpus or for download from the LINDAT data repository. <ref type="bibr">12</ref> Word forms are tagged by word class, morphological categories and base forms (lemmas). Some forms are corrected by Korektor, a context-sensitive spelling/grammar checker, <ref type="bibr">13</ref> and the resulting texts are tagged again. Original and corrected forms are compared and error labels are assigned. Korektor detected and corrected 13.24% incorrect forms, 10.33% labelled as including a spelling error, and 2.92% an error in grammar, i.e. a 'real-word' error. Both the original, uncorrected texts and their corrected version were tagged and lemmatized, and "formal error tags," based on the comparison of the uncorrected and corrected forms, were assigned. <ref type="bibr">14</ref> The share of non-words detected by the tagger is slightly lower -9.23% (the tagger uses a larger lexicon).</p><p>Automatic correction is a crucial annotation step. The tool is concerned mainly with errors in orthography and morphemics, and handles some errors in morphosyntax, including real-word errors (i.e. errors that produce a word which seems to be correct out of context), as long as they are detectable locally, within a reasonably small window of n-grams. Corrections are limited to single words, targetting a single character or a very small number of characters by insertion, omission, substitution, transposition, addition, deletion or substitution of a diacritic. Errors that involve joining or splitting of word tokens or word-order errors of any type are not handled at the moment.</p><p>The performance of Korektor was evaluated first in Štindlová et al. (2012) with about 20% error rate on the set of non-words, and later in Ramasamy et al. (2015). In an optimal setting of the model, the best results achieved in terms of F1 score were 95.4% for error detection and 91.0% for error correction. In a manual analysis of 3000 tokens, about 23% of the tokens included either a form <ref type="bibr">11</ref> Czech as a Second Language with Spelling, Grammar and Tags 12 http://hdl.handle.net/11234/1-162 13 See Richter et al. (2012). The tool is available from the LINDAT repository (https://lindat.mff.cuni.cz) under the FreeBSD license.</p><p>14 See Jelínek et al. ( <ref type="formula">2012</ref>).</p><p>error at Tier 1 (62%), a grammar error at Tier 2 (27%), or an accumulated error at both tiers (11%). Form errors were detected with a success rate of 89%. For grammar errors (real-word errors) the detection rate was much lower, about 15.5%. The detection of accumulated errors was similar to form errors (89%). After all the automatic annotation steps are finished, each token is labelled by the following attributes:</p><p>• word -original word form</p><p>• lemma -lemma of word; same as word if the form is not recognized</p><p>• tag -morphological tag of word; if the form is not recognized: X@-------------</p><p>• word1 -corrected form; same as word if determined as correct</p><p>• lemma1 -lemma of word1</p><p>• tag1 -morphological tag of word1</p><p>• gs -information on whether the error was determined as a spelling (S) or grammar (G) error; for grammar errors, word is mostly recognized</p><p>• err -error type, determined by comparing word and word1.</p><p>Table <ref type="table">3</ref> on p. shows the use of the annotation in a simple sentence (1). <ref type="bibr">15</ref> (1)</p><p>Tén that pes dog míluje loves svécho self's kamarada friend --člověka. man 'That dog loves its friend -the man.'</p><p>In addition to the attributes listed above, the search interface of the Czech National Corpus offers "dynamic" attributes, derived from some positions of tag and tag1. Dynamic attributes can be used in queries to specify values of morphological categories without regular expressions, to stipulate identity of these values in two or more forms to require grammatical concord, or to compare values of a category for word and word1. These attributes are available for the following categories of the original and the corrected form:</p><p>• k, k1 -word class (position 1 of the tag)</p><p>• s, s1 -detailed word class (position 2 of the tag)</p><p>• g, g1 -gender (position 3 of the tag)</p><p>• n, n1 -number (position 4 of the tag)</p><p>• c, c1 -case (position 5 of the tag)</p><p>• p, p1 -person (position 8 of the tag) They are meant especially for CQL queries<ref type="foot" target="#foot_10">16</ref> including a "global condition". As in standard corpora, such queries target two or more word tokens with an arbitrary but equal value of an attribute such as case to express grammatical agreement and similar morphosyntactic phenomena (2).</p><p>(2)</p><formula xml:id="formula_0">1:[] 2:[] &amp; 1.c = 2.c</formula><p>In a learner corpus, such queries make sense even for a single word token, e.g. for expressing identical or distinct values of the morphological case of the original form and of its corrected version (3).<ref type="foot" target="#foot_11">17</ref> </p><p>(3)</p><formula xml:id="formula_1">1:[] &amp; 1.c != 1.c1</formula><p>In a learner corpus, metadata about the author of the text are at least as important as all other types of annotation.</p><p>For the number of texts authored by students according to their first language and the CEFR proficiency level in Czech see  <ref type="table" target="#tab_5">6 and 7</ref> show the number of texts for each combination of CEFR level and language group in CzeSL-man v. 1.  In addition to the number of tokens for the same category, Table <ref type="table" target="#tab_6">8</ref> shows also the frequency of errors of the dep type, i.e. valency errors in the broad sense, including errors in the number of complements and adjuncts or errors in their morphosyntactic expression. The rather frequent error type shows a considerable and expected decrease in higher proficiency levels CzeSL-man v. 1 is about to be released soon for download in the LINDAT repository and for on-line searching in https://kontext.korpus.cz. Some solutions to the problem of using a feature-rich corpus search engine, which is still not suited to the two-level annotation scheme of CzeSL-man, are presented in 4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Some issues and lessons learnt</head><p>Several points can be made about some of the CzeSL releases, reflecting issues involved in the design, compilation and presentation of learner corpora.</p><p>We start with CzeSL-plain and its hand-annotated part CzeSL-man v. 0: (i) Both corpora include some ROMi texts, actually produced by native speakers of a dialect of Czech, rather than by non-native speakers of Czech. This is due to the original strategy of grouping texts by the way they are processed. This has been changed in later releases, where texts produced by non-native and native learners (the latter including speakers of the Romani ethnolect of Czech) are parts of distinct corpora. (ii) Neither   CzeSL-plain nor CzeSL-man v. 0 includes the full set of metadata, which were not available in the appropriate form and content at the time the two corpora were prepared and released. In CzeSL-plain, the texts are categorized into three groups: as essays, written either by non-native learners, or by speakers of the Roma ethnolect of Czech, and as theses written by non-native students. In CzeSL-man v. 0 there is no distiction available. (iii) Due to the uncertainty abouth the optimal way of representing the complex twolevel manual annotation, the SeLaQ tool cannot display the two-level annotation format in a graphical format.</p><p>There is a strong demand for CzeSL-man to become available for on-line searches at the Czech National Corpus portal, even if some of the properties and information present in the corpus may get lost in the conversion to the format used by the corpus search tool, based on the singlelevel annotation of a string of tokens. However, the converted format might still retain enough annotation to be attractive and useful for most tasks. Instead of assigning the error-related annotation to word tokens, which makes the option to annotate strings of tokens, or even discontinuous strings very difficult, errors and corrections can be treated as structural annotation, i.e. similarly to the markup for paragraphs, sentences, phrases or text chunks. Even the splitting and joining of words and word order corrections can then be expressed.</p><p>The Manatee corpus search engine, used in the Czech National Corpus, and its (No)Sketch Engine front end actually include support for learner corpora, <ref type="bibr">18</ref> . The in-line annotation can even have embedded structures, which may be used at least for some cases of multi-layered annotation. Making CzeSL-man with most of the annotation available this way thus seems a real prospect.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Corpus design and planning</head><p>The target corpus may be intended for a group of users with specific research or practical needs, or for a wide audience of language acquisition experts, researchers or practitioners. In any case the goals should be realistic in order to avoid a mission ending before the goals are achieved.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Text acquisition</head><p>Some balance or at least representative proportions of text and learner categories are necessary or at least useful. Tables 4-7 show an opposite, opportunistic approach, driven by practical constraints, often justified by the unavailablity of texts of a specific category.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Transcription</head><p>To avoid the need of cleaning transcripts with improperly used mark-up, an editing tool including strict format controls is preferable to a free-text editor.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Annotation scheme and searching</head><p>A scheme ideally suited to the data may turn into a problem later, if the consequences for the annotation process and the use of the corpus are not foreseen. Standard concordancers may require substantial tweaking of the data, while a custom-built tool may lack features of the tools developed for a long time. At the same time, most users of this type of corpora definitely need a friendly interface.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>We have presented several releases of a learner corpus of Czech, available for on-line queries and under the Creative Commons license as full texts.</p><p>In order to reach its goals and become useful, a learner corpus project should be conceived carefully, considering many factors. By way of an example, we have shown some pitfalls in the process of building and presenting such a corpus.</p><p>The methods and tools developed within this project are not tied to the specific use and we hope they will be found useful in other projects.   </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>be very unhappy about it.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Two-level manual annotation of a sentence in CzeSL, the English glosses are added</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 4</head><label>4</label><figDesc></figDesc><table><row><cell></cell><cell cols="5">below. The language group abbrevia-</cell></row><row><cell cols="6">tions read as follows: IE = non-Slavic Indo-European, nIE</cell></row><row><cell cols="4">= non-Indo-European, S = Slavic.</cell><cell></cell><cell></cell></row><row><cell></cell><cell>S</cell><cell>IE</cell><cell cols="2">nIE unknown</cell><cell>Σ</cell></row><row><cell>A1</cell><cell cols="2">1783 199</cell><cell>622</cell><cell cols="2">5 2609</cell></row><row><cell>A1+</cell><cell>283</cell><cell>21</cell><cell>11</cell><cell>0</cell><cell>315</cell></row><row><cell>A2</cell><cell cols="2">1348 269</cell><cell>480</cell><cell cols="2">1 2098</cell></row><row><cell>A2+</cell><cell>403</cell><cell>54</cell><cell>113</cell><cell>0</cell><cell>570</cell></row><row><cell>B1</cell><cell cols="2">929 195</cell><cell>357</cell><cell cols="2">0 1481</cell></row><row><cell>B2</cell><cell cols="2">523 115</cell><cell>107</cell><cell>0</cell><cell>745</cell></row><row><cell>C1</cell><cell>82</cell><cell>17</cell><cell>24</cell><cell>0</cell><cell>123</cell></row><row><cell>C2</cell><cell>0</cell><cell>1</cell><cell>0</cell><cell>0</cell><cell>1</cell></row><row><cell>unknown</cell><cell>291</cell><cell>27</cell><cell>33</cell><cell>324</cell><cell>675</cell></row><row><cell>Σ</cell><cell cols="3">5642 898 1747</cell><cell cols="2">330 8617</cell></row><row><cell cols="6">Table 4: Number of texts by language group and profi-</cell></row><row><cell cols="3">ciency level in CzeSL-SGT</cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">3.3 CzeSL-man v. 1</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="6">CzeSL-man v. 1 is a collection of manually annotated tran-</cell></row><row><cell cols="6">scripts of essays of non-native speakers of Czech, written</cell></row><row><cell cols="6">in 2009-2013, the total of 645 texts, including 298 doubly</cell></row><row><cell cols="6">annotated texts. The texts contain 128 thousand word to-</cell></row><row><cell cols="6">kens, including 59 thousand doubly annotated tokens; for</cell></row><row><cell cols="5">a comparison with CzeSL-SGT see Table 5.</cell><cell></cell></row><row><cell>Tables</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 5 :</head><label>5</label><figDesc>CzeSL-man v. 1 and CzeSL-SGT compared</figDesc><table><row><cell></cell><cell cols="4">S IE nIE unknown</cell><cell>Σ</cell></row><row><cell>A1</cell><cell>49</cell><cell>6</cell><cell>4</cell><cell></cell><cell>59</cell></row><row><cell>A1+</cell><cell></cell><cell></cell><cell>3</cell><cell></cell><cell>3</cell></row><row><cell>A2</cell><cell cols="2">18 26</cell><cell>67</cell><cell></cell><cell>111</cell></row><row><cell>A2+</cell><cell>81</cell><cell>9</cell><cell>59</cell><cell></cell><cell>149</cell></row><row><cell>B1</cell><cell cols="2">123 26</cell><cell>30</cell><cell></cell><cell>179</cell></row><row><cell>B2</cell><cell cols="2">102 11</cell><cell>15</cell><cell></cell><cell>128</cell></row><row><cell>C1</cell><cell>10</cell><cell></cell><cell>2</cell><cell></cell><cell>12</cell></row><row><cell>unknown</cell><cell></cell><cell></cell><cell></cell><cell>4</cell><cell>4</cell></row><row><cell>Σ</cell><cell cols="3">383 78 180</cell><cell cols="2">4 645</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 6 :</head><label>6</label><figDesc>Number of texts by language group and proficiency level in CzeSL-man v. 1</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 7 :</head><label>7</label><figDesc>Number of doubly annotated texts by language group and proficiency level in CzeSL-man v. 1</figDesc><table><row><cell></cell><cell>A1</cell><cell>A2</cell><cell>B1</cell><cell>B2</cell><cell>C1</cell><cell>Σ</cell></row><row><cell>IE</cell><cell>227</cell><cell>7,336</cell><cell>5,311</cell><cell>2,340</cell><cell>0</cell><cell>15,214</cell></row><row><cell>dep</cell><cell>13</cell><cell>361</cell><cell>118</cell><cell>28</cell><cell>0</cell><cell>520</cell></row><row><cell>%dep</cell><cell>5.73%</cell><cell>4.92%</cell><cell>2.22%</cell><cell>1.20%</cell><cell></cell><cell>3.42%</cell></row><row><cell>nIE</cell><cell>439</cell><cell>17,640</cell><cell>7,606</cell><cell>4,219</cell><cell>760</cell><cell>30,664</cell></row><row><cell>dep</cell><cell>13</cell><cell>715</cell><cell>237</cell><cell>116</cell><cell>7</cell><cell>1,088</cell></row><row><cell>%dep</cell><cell>2.96%</cell><cell>4.05%</cell><cell>3.12%</cell><cell>2.75%</cell><cell>0.92%</cell><cell>3.55%</cell></row><row><cell>S</cell><cell cols="4">6,434 16,939 27,226 22,173</cell><cell>4,761</cell><cell>77,533</cell></row><row><cell>dep</cell><cell>225</cell><cell>470</cell><cell>652</cell><cell>443</cell><cell>17</cell><cell>1,807</cell></row><row><cell>%dep</cell><cell>3.50%</cell><cell>2.77%</cell><cell>2.39%</cell><cell>2.00%</cell><cell>0.36%</cell><cell>2.33%</cell></row><row><cell>Σ</cell><cell cols="4">7,100 41,915 40,143 28,732</cell><cell cols="2">5,521 123,411</cell></row><row><cell>dep</cell><cell>251</cell><cell>1,546</cell><cell>1,007</cell><cell>587</cell><cell>24</cell><cell>3,415</cell></row><row><cell>%dep</cell><cell>3.54%</cell><cell>3.69%</cell><cell>2.51%</cell><cell>2.04%</cell><cell>0.43%</cell><cell>2.77%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 8 :</head><label>8</label><figDesc>Number of tokens and valency errors by language group and proficiency level in CzeSL-man v. 1</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_7"><head>Table 1 :</head><label>1</label><figDesc>A list of learner corpora around the world</figDesc><table><row><cell>Corpus</cell><cell>Size (MW)</cell><cell></cell><cell cols="2">L1 L2</cell><cell>Level</cell><cell cols="2">Medium Annotation</cell></row><row><cell>ICLE</cell><cell>3</cell><cell></cell><cell cols="2">26 en</cell><cell>advanced</cell><cell>written</cell><cell>part</cell></row><row><cell>CLC</cell><cell>35</cell><cell></cell><cell cols="2">130 en</cell><cell>all</cell><cell>written</cell><cell>part</cell></row><row><cell>LINDSEI</cell><cell>0.8</cell><cell></cell><cell cols="2">11 en</cell><cell>advanced</cell><cell>spoken</cell><cell>part</cell></row><row><cell>PELCRA</cell><cell>0.5</cell><cell></cell><cell cols="2">pl en</cell><cell>all</cell><cell>written</cell><cell>part</cell></row><row><cell>USE</cell><cell>1.2</cell><cell></cell><cell cols="2">sv en</cell><cell>advanced</cell><cell>written</cell><cell>no</cell></row><row><cell>HKUST</cell><cell>25</cell><cell></cell><cell cols="2">zh en</cell><cell>advanced</cell><cell>written</cell><cell>part</cell></row><row><cell>CHUNGDAHM</cell><cell>131</cell><cell></cell><cell cols="2">ko en</cell><cell>all</cell><cell>written</cell><cell>part</cell></row><row><cell>JEFLL</cell><cell>0.7</cell><cell></cell><cell cols="3">jp en beginners</cell><cell>written</cell><cell>part</cell></row><row><cell>MELD</cell><cell>1</cell><cell></cell><cell cols="2">16 en</cell><cell>advanced</cell><cell>written</cell><cell>no</cell></row><row><cell>MICASE</cell><cell>1.8</cell><cell cols="3">various en</cell><cell>advanced</cell><cell>spoken</cell><cell>no</cell></row><row><cell>NICT JLE</cell><cell>2</cell><cell></cell><cell cols="2">jp en</cell><cell>all</cell><cell>spoken</cell><cell>part</cell></row><row><cell>RusLTC</cell><cell>1.5</cell><cell></cell><cell cols="2">ru en</cell><cell>advanced</cell><cell>written</cell><cell>no</cell></row><row><cell>FALKO</cell><cell>0.3</cell><cell></cell><cell cols="2">5 de</cell><cell>advanced</cell><cell>written</cell><cell>part</cell></row><row><cell>FRIDA</cell><cell>0.2</cell><cell cols="2">various</cell><cell>fr</cell><cell>med-adv</cell><cell>spoken</cell><cell>part</cell></row><row><cell>FLLOC</cell><cell>2</cell><cell></cell><cell>en</cell><cell>fr</cell><cell>all</cell><cell>spoken</cell><cell>no</cell></row><row><cell>PiKUST</cell><cell>0.04</cell><cell></cell><cell>18</cell><cell>sl</cell><cell>advanced</cell><cell>written</cell><cell>yes</cell></row><row><cell>ASU</cell><cell>0.5</cell><cell cols="4">various no advanced</cell><cell>written</cell><cell>no</cell></row><row><cell>TUFS</cell><cell cols="3">0.6 Mchars various</cell><cell>jp</cell><cell>all</cell><cell>written</cell><cell>no</cell></row><row><cell></cell><cell cols="2">Non-native Essays Theses</cell><cell cols="3">Ethnolect TOTAL</cell><cell cols="2">Annotation Metadata</cell></row><row><cell>CzeSL-plain</cell><cell>1315</cell><cell>732</cell><cell></cell><cell>428</cell><cell>2475</cell><cell>no</cell><cell>no</cell></row><row><cell>CzeSL-SGT</cell><cell>1147</cell><cell></cell><cell></cell><cell></cell><cell>1147</cell><cell>auto</cell><cell>yes</cell></row><row><cell>CzeSL-man v.0, a1</cell><cell>134</cell><cell></cell><cell></cell><cell>192</cell><cell>326</cell><cell>manual</cell><cell>no</cell></row><row><cell>CzeSL-man v.0, a2</cell><cell>59</cell><cell></cell><cell></cell><cell>149</cell><cell>208</cell><cell>manual</cell><cell>no</cell></row><row><cell>CzeSL-man v.1</cell><cell>134</cell><cell></cell><cell></cell><cell></cell><cell>134</cell><cell>manual</cell><cell>yes</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_8"><head>Table 2 :</head><label>2</label><figDesc>Available releases of CzeSL</figDesc><table><row><cell>Bojal *feared</cell><cell>jsme aux</cell><cell></cell><cell></cell><cell cols="4">že ona that she rflx not will se ne bude</cell><cell>libila *like</cell><cell cols="2">slavnou prahu , famous Prague ,</cell><cell>proto to bylo therefore it was</cell><cell>velmí *very</cell><cell>vadí pro mně . resent for me .</cell></row><row><cell>incorInfl</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="2">wbdPre incorBase</cell><cell></cell><cell>incorBase</cell></row><row><cell>Bál</cell><cell>jsme</cell><cell></cell><cell></cell><cell cols="3">že ona se</cell><cell>nebude</cell><cell>líbila</cell><cell cols="2">slavnou Prahu ,</cell><cell>proto to bylo</cell><cell>velmi</cell><cell>vadí pro mně .</cell></row><row><cell></cell><cell>agr</cell><cell>rflx</cell><cell></cell><cell></cell><cell>dep</cell><cell></cell><cell></cell><cell>vbx</cell><cell>agr,sec</cell><cell>dep</cell></row><row><cell>Bál</cell><cell>jsem</cell><cell>se</cell><cell>,</cell><cell>že</cell><cell>se</cell><cell>jí</cell><cell>nebude</cell><cell>líbit</cell><cell cols="2">slavná Praha ,</cell></row><row><cell cols="2">I was afraid</cell><cell></cell><cell></cell><cell></cell><cell cols="6">that she would not like the famous city of Prague,</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0">Multilingual Platform for European Reference Levels: Interlanguage Exploration in Context, see http://merlin-platform.eu and Wisniewski et al. (2014); Boyd et al. (2014)</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1">https://bitbucket.org/jhana/feat</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2">Some texts in CzeSL-man v.0 are doubly annotated. The texts annotated by an additional annotator are included in the CzeSL-man v.0, a2 part. See http://utkl.ff.cuni.cz/learncorp/ for links and more details.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3">5 https://kontext.korpus.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_4">cz 6 http://lindat.mff.cuni</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_5">.cz 7 http://chomsky.ruk.cuni.cz:5125</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_6">This scheme is already a compromise between a linear annotation and an open multi-layered format, but a compromise preserving links between split, joined and re-ordered tokens, corrected in two stages simultaneously, something not obviously supported in the multilayered tabular format mentioned above in</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_7">§2. 9 See<ref type="bibr" target="#b1">Hana et al. (2010)</ref> and<ref type="bibr" target="#b6">Rosen et al. (2014)</ref> for more</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_8">details.10  See<ref type="bibr" target="#b3">Jelínek et al. (2012)</ref> for details, including a list of formal error types. The last column of Table3shows examples of the formal error labels.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="15" xml:id="foot_9">The example comes from a CzeSL-SGT text, written by a 17 years old student, with Russian as L1 and B2 as the proficiency level in Czech (document ID ttt_G1_434).</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="16" xml:id="foot_10">See https://www.sketchengine.co.uk/corpus-querying/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="17" xml:id="foot_11">Unfortunately, queries including global conditions on dynamic attributes do not produce expected results in the present version of the Manatee search engine.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_12">Building and Using Corpora of Non-Native Czech</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="18" xml:id="foot_13">See https://www.sketchengine.co.uk/learner-corpus-functionality/</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>The corpus could never be built without many other members of the CzeSL team. For the work reported here the author is grateful especially to Barbora Štindlová, Jirka Hana and Tomáš Jelínek. The author's thanks are also due to two anonymous reviewers who helped to improve the paper, and to the Grant Agency of the Czech Republic, which currently provides financial support for Non-native Czech from the Theoretical and Computational Perspective (project ID 16-10185S).</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0" />			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">The MERLIN corpus: Learner language and the CEFR</title>
		<author>
			<persName><forename type="first">A</forename><surname>Boyd</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hana</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Nicolas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Meurers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Wisniewski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Abel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Schöne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Štindlová</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Vettori</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC&apos;14)</title>
				<editor>
			<persName><forename type="first">N</forename><surname>Calzolari</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Choukri</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Declerck</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Loftsson</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">B</forename><surname>Maegaard</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Mariani</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Moreno</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Odijk</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Piperidis</surname></persName>
		</editor>
		<meeting>the Ninth International Conference on Language Resources and Evaluation (LREC&apos;14)<address><addrLine>Reykjavik, Iceland</addrLine></address></meeting>
		<imprint>
			<publisher>ELRA</publisher>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Error-tagged learner corpus of Czech</title>
		<author>
			<persName><forename type="first">J</forename><surname>Hana</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rosen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Škodová</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Štindlová</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Fourth Linguistic Annotation Workshop</title>
				<meeting>the Fourth Linguistic Annotation Workshop<address><addrLine>Uppsala, Sweden</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Building a learner corpus</title>
		<author>
			<persName><forename type="first">J</forename><surname>Hana</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rosen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Štindlová</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Štěpánek</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Language Resources and Evaluation</title>
		<imprint>
			<biblScope unit="volume">48</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="741" to="752" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Combining manual and automatic annotation of a learner corpus</title>
		<author>
			<persName><forename type="first">T</forename><surname>Jelínek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Štindlová</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rosen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hana</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Text, Speech and Dialogue -Proceedings of the 15th International Conference TSD 2012</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<editor>
			<persName><forename type="first">P</forename><surname>Sojka</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Horák</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">I</forename><surname>Kopeček</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Pala</surname></persName>
		</editor>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2012">2012. 7499</date>
			<biblScope unit="page" from="127" to="134" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Improvements to Korektor: A case study with native and non-native Czech</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ramasamy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rosen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Straňák</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ITAT 2015: Information technologies -Applications and Theory / SloNLP 2015</title>
				<editor>
			<persName><forename type="first">J</forename><surname>Yaghob</surname></persName>
		</editor>
		<meeting><address><addrLine>Prague</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="73" to="80" />
		</imprint>
		<respStmt>
			<orgName>Charles University in Prague</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Korektor -a system for contextual spell-checking and diacritics completion</title>
		<author>
			<persName><forename type="first">M</forename><surname>Richter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Straňák</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rosen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The COLING 2012 Organizing Committee</title>
				<meeting><address><addrLine>Mumbai, India</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="1019" to="1028" />
		</imprint>
	</monogr>
	<note>Proceedings of COLING 2012: Posters</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Evaluating and automating the annotation of a learner corpus</title>
		<author>
			<persName><forename type="first">A</forename><surname>Rosen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hana</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Štindlová</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Feldman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Language Resources and Evaluation -Special Issue: Resources for language learning</title>
		<imprint>
			<biblScope unit="volume">48</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="65" to="92" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">The MERLIN annotation scheme for the annotation of German, Italian, and Czech learner language</title>
		<author>
			<persName><forename type="first">K</forename><surname>Wisniewski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Woldt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Schöne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Abel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Blaschitz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Štindlová</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Vodičková</surname></persName>
		</author>
		<ptr target="http://merlin-platform.eu/" />
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">Technical report</note>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Korpusy češtiny a osvojování jazyka [Corpora of Czech and language acquistion</title>
		<author>
			<persName><forename type="first">K</forename><surname>Šebesta</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title/>
	</analytic>
	<monogr>
		<title level="j">Studie z aplikované lingvistiky/Studies in Applied Linguistics</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="11" to="34" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Evaluace chybové anotace navržené pro žákovský korpus češtiny</title>
		<author>
			<persName><forename type="first">B</forename><surname>Štindlová</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SALi</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="37" to="60" />
			<date type="published" when="2011">2011a</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Evaluace chybové anotace v žákovském korpusu češtiny [Evaluation of Error Mark-Up in a Learner Corpus of Czech</title>
		<author>
			<persName><forename type="first">B</forename><surname>Štindlová</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2011">2011b</date>
			<pubPlace>Prague</pubPlace>
		</imprint>
		<respStmt>
			<orgName>Charles University ; Faculty of Arts</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">PhD thesis</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">CzeSL -an error tagged corpus of Czech as a second language</title>
		<author>
			<persName><forename type="first">B</forename><surname>Štindlová</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rosen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hana</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Škodová</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Corpus Data across Languages and Disciplines</title>
		<title level="s">Łódź Studies in Language</title>
		<editor>
			<persName><forename type="first">P</forename><surname>Pęzik</surname></persName>
		</editor>
		<meeting><address><addrLine>Frankfurt am Main</addrLine></address></meeting>
		<imprint>
			<publisher>Peter Lang</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="page" from="21" to="32" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
