<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Series</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Building and using corpora of non-native Czech</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexandr Rosen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Theoretical and Computational Linguistics, Faculty of Arts Charles University in Prague</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>1649</volume>
      <fpage>80</fpage>
      <lpage>87</lpage>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Investigating language acquisition by non-native learners
helps to understand important linguistic issues and develop
teaching methods, better suited both to the specific target
language and to the learner. These tasks can now be based
on empirical evidence from learner corpora.</p>
      <p>A learner corpus consists of language produced by
language learners, typically learners of a second or foreign
language (L2). Such corpora may be equipped with
morphological and syntactic annotation, together with the
detection, correction and categorization of non-standard
linguistic phenomena.</p>
      <p>The tasks of designing, compiling, annotating and
presenting such corpora are often very much unlike those
routinely applied to standard corpora. There may be no
standard or obvious solutions: the approach to the tasks is
often seen as an answer to a specific research goal rather
than as a service to a wider community of researchers and
practitioners. Our aim is to investigate some of the
challenges, based on a learner corpus of Czech in comparison
to several other learner corpora.</p>
      <p>After an overview of learner corpora around the world
in §2 and a brief presentation of several releases of a
learner corpus of Czech in §3, we examine issues inherent
to the process of compiling, annotating and using such
corpora, including automatic identification of errors, the
design and application of error taxonomy, and a user-friendly
search tool, suited to a complex annotation (§4).</p>
    </sec>
    <sec id="sec-2">
      <title>About learner corpora</title>
      <p>Most of the existing learner corpora include English (L2)
as produced by students whose native languages (L1) are
varied. Most of the corpora are partially error-annotated,
see Table 1 on p. .1 The error annotation is usually
inline, equivalent to XML tags, denoting the scope,
correction and categorization of an error. A few corpora such
as FALKO include multi-layered annotation in a tabular
format, with the option of specifying multiple target
hypotheses (corrections) and several error types for single
word tokens or strings thereof at different levels of
linguistic abstraction: orthography, morphology, syntax, lexicon,
pragmatics, intelligibility.</p>
      <p>1For a more extensive overview see Štindlová (2011a) or an actively
maintained list at https://www.uclouvain.be/en-cecl-lcworld.html.</p>
      <p>The tabular format is also used in MERLIN, one of the
two currently available corpora including Czech.2 In
addition to 64.5K words of Czech in CEFR levels A1–C1,
the corpus includes also German and Italian. It is tagged,
lemmatized, parsed and on-line searchable, with a detailed
error taxonomy and the option of two target hypotheses.
3</p>
      <p>CzeSL – the learner corpus of Czech as a
Second Language
CzeSL is a part of an umbrella project, the Acquisition
Corpora of Czech (AKCES), a research programme
pursued since 2005 (Šebesta, 2010). In addition to CzeSL,
AKCES has a written (SKRIPT) and spoken (SCHOLA)
part collected from native Czech pupils, and ROMi, a part
collected from pupils with Romani background, using the
Romani ethnolect of Czech as their first language (L1). In
the present paper we focus on written texts produced by
non-native learners of Czech. However, most of the
methods and tools can be applied to other parts of the corpus.</p>
      <p>CzeSL is focused on native speakers of three main
language groups: (1) Slavic, (2) other Indo-European, (3)
non-Indo-European. The hand-written texts cover all
language levels, from real beginners (A1) to advanced
learners (B2, C1, C2). The texts are equipped with metadata
records; some of them relate to the respondent (age,
gender, first language, proficiency in Czech, knowledge of
other languages, duration and conditions of language
acquisition), while other specify the character of the text and
circumstances of its production (availability of reference
tools, type of elicitation, temporal and size restrictions
etc.).</p>
      <p>
        The hand-written texts were transcribed using
off-theshelf editors supporting HTML (e.g., Microsoft Word or
Open Office Writer). A set of codes was used to
capture variants, illegible strings, self-corrections; for details
see
        <xref ref-type="bibr" rid="ref10 ref11">(Štindlová, 2011b, p. 106ff)</xref>
        . During the
transcription step, the texts were anonymized by replacing personal
names with appropriate forms of Adam and Eva. Names
of smaller places (streets, villages, small towns) and other
potentially sensitive data were replaced by QQQ.
Unreadable characters or words were transcribed as XXX.
      </p>
      <p>
        The transcripts were converted into an XML format.
Some of them were corrected (‘emended’) and labelled
2Multilingual Platform for European Reference Levels:
Interlanguage Exploration in Context, see http://merlin-platform.eu and
        <xref ref-type="bibr" rid="ref8">Wisniewski et al. (2014)</xref>
        ;
        <xref ref-type="bibr" rid="ref1">Boyd et al. (2014)</xref>
        by error categories using a custom-built annotation
editor, supporting a two-layered annotation format with m : n
links between tokens at the neighbouring tiers.3 In a
postprocessing step the hand-annotated texts were tagged by
tools trained on native Czech in a way similar to
standard corpora, i.e. by lemmas, morphosyntactic categories,
in some (currently non-public) releases of the corpus also
by syntactic functions and structure. Some error
annotation tasks were also done automatically: the assignment of
formal error labels and even the correction step (the latter
in Czesl-SGT, see §3.2).
      </p>
      <p>There are several public releases of CzeSL, which
differ in the depth and method of annotation, but also in the
availability of metadata and size. Table 2 shows the
content of available releases of CzeSL, including the volumes
(in thousands of tokens), and the availability of annotation
and metadata.4
3.1</p>
      <sec id="sec-2-1">
        <title>Releases of CzeSL without metadata:</title>
        <p>CzeSL-plain and CzeSL-man v. 0
Since 2012, the transcripts of essays hand-written by
nonnative learners (1.3 mil. tokens) and pupils speaking the
Romani ethnolect of Czech (0.4 mil. tokens) have been
available together with some Bachelor and Master
theses written in Czech by foreign students (0.7 mil. tokens)
as the CzeSL-plain corpus, on-line searchable via a
webbased search interface of the Czech National Corpus,5 or
as full texts under the Creative Commons license from
the LINDAT repository.6 Except for specifying the three
groups above and a basic structural mark-up, this corpus
does not include any metadata or annotation.</p>
        <p>
          CzeSL-man v. 0 includes subsets of CzeSL and ROMi,
about 330 thousand tokens. It is manually error-annotated
at two levels. Texts of about 208 thousand tokens are
annotated independently by two annotators. Like CzeSL-plain,
the whole hand-annotated part is accessible online
without metadata via a purpose-built search tool (SeLaQ);7 for
more about the manual annotation and the annotation
process see
          <xref ref-type="bibr" rid="ref3">Hana et al. (2014)</xref>
          .
        </p>
        <p>The manual annotation scheme in CzeSL is based on
a two-stage annotation design, reflecting the distinction
roughly between errors in orthography and morphemics
on the one hand and all other error types on the other.
Tokens in the original transcript are linked with their
counterparts at the two successive levels by edges, possibly
labelled with the type of error – see Figure 1 on p. . A
syntactic error label may be linked by a pointer to a word
token, specifying an agreement, valency or referential
re3https://bitbucket.org/jhana/feat
4Some texts in CzeSL-man v.0 are doubly annotated. The texts
annotated by an additional annotator are included in the CzeSL-man v.0, a2
part. See http://utkl.ff.cuni.cz/learncorp/ for links and more details.
5https://kontext.korpus.cz
6http://lindat.mff.cuni.cz
7http://chomsky.ruk.cuni.cz:5125
lation.8 The level of transcribed input (Tier 0) is followed
by the level of orthographical and morphemic corrections
(Tier 1), where only forms incorrect in any context are
treated. Errors at Tier 1 are mainly non-word errors while
those at Tier 2 are real-word and grammatical errors.
However, a faulty form that happens to be spelled as a form
which would be correct in a different context, is still
corrected at Tier 1. The result at Tier 1 is a string
consisting of correct Czech forms, even though the sentence may
not be correct as a whole. All other types of errors are
corrected at Tier 2, representing a grammatically correct,
though stylistically not necessarily optimal target
hypothesis.9 Manual annotation is complemented by
morphosyntactic tags and lemmas at Tier 2, ambiguously specified
tags and lemmas at Tier 1, and automatically identified
formal errors.10 Splitting, joining and reordering words,
together with the pointers may make the picture rather
complex, as in an authentic sentence in Figure 1 on p. .</p>
        <p>The three tiers are represented as parallel strings of
word forms with links for corresponding forms. Tier 0
is glossed for readability; forms marked by asterisks are
incorrect in any context.</p>
        <p>Errors corrected at Tier 1 include incorrect
inflection (incorInfl), word boundaries (wbdPre), and stems
(incorBase). Errors in punctuation (the missing comma),
capitalization (prahu) or word order (se in the that-clause
at Tier 2) are tagged automatically in a post-processing
step.</p>
        <p>Tier 2 captures the rest of errors. Some error labels are
linked to a token which makes the reason for the
correction explicit. This includes errors in agreement (agr),
government or valency in a broad sense (dep), complex verb
forms (vbx) or reflexive particles (rflx). For example, ona
in the nominative case is governed by the form líbit se, and
should be in the dative case: jí. The label dep has an
arrow pointing to the governor líbit. There is also a simple
lexical correction: Proto ‘therefore’ is changed to protože
‘because’.</p>
        <p>However, the main issue are the two finite verbs bylo
and vadí. The most likely intention of the author is best
expressed by the conditional mood. The two non-contiguous
forms are replaced by the conditional auxiliary and the
content verb participle in one step using a 2:2 relation.
Another complex issue is the prepositional phrase pro mneˇ
‘for me’. Its proper form is pro meˇ (homonymous with pro
mneˇ, but with ‘me’ in accusative instead of dative), or pro
mne. The accusative case is required by the preposition
pro. However, the head verb requires that this
complement bears bare dative – mi. Additionally, this form is a
8This scheme is already a compromise between a linear annotation
and an open multi-layered format, but a compromise preserving links
between split, joined and re-ordered tokens, corrected in two stages
simultaneously, something not obviously supported in the multilayered tabular
format mentioned above in §2.</p>
        <p>
          9See
          <xref ref-type="bibr" rid="ref2">Hana et al. (2010)</xref>
          and
          <xref ref-type="bibr" rid="ref7">Rosen et al. (2014)</xref>
          for more details.
10See
          <xref ref-type="bibr" rid="ref4">Jelínek et al. (2012)</xref>
          for details, including a list of formal error
types. The last column of Table 3 shows examples of the formal error
labels.
clitic, following the conditional auxiliary.
        </p>
        <p>The correction slavnouaccusative →slavnánominative is due
to the correction of the case of the head noun. Such
corrections receive an additional label as secondary errors.
3.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>The automatically anotated CzeSL-SGT</title>
        <p>The ‘real’ CzeSL, i.e. the corpus consisting of essays
written only by non-native learners (1.1 mil. tokens), is
available with automatic annotation as CzeSL-SGT,11
extending the “foreign” part of the CzeSL-plain corpus by texts
collected in 2013. This was the first release of CzeSL
including full metadata. The corpus includes 8,617 texts by
1,965 different authors with 54 different first languages.
The original transcription markup is discarded in this
corpus, while the final author’s version is restored. The
corpus is available again either for on-line searching using
the search interface of the Czech National Corpus or for
download from the LINDAT data repository.12</p>
        <p>Word forms are tagged by word class, morphological
categories and base forms (lemmas). Some forms are
corrected by Korektor, a context-sensitive spelling/grammar
checker,13 and the resulting texts are tagged again.
Original and corrected forms are compared and error labels are
assigned. Korektor detected and corrected 13.24%
incorrect forms, 10.33% labelled as including a spelling error,
and 2.92% an error in grammar, i.e. a ‘real-word’ error.
Both the original, uncorrected texts and their corrected
version were tagged and lemmatized, and “formal error
tags,” based on the comparison of the uncorrected and
corrected forms, were assigned.14 The share of non-words
detected by the tagger is slightly lower – 9.23% (the tagger
uses a larger lexicon).</p>
        <p>Automatic correction is a crucial annotation step. The
tool is concerned mainly with errors in orthography and
morphemics, and handles some errors in morphosyntax,
including real-word errors (i.e. errors that produce a word
which seems to be correct out of context), as long as they
are detectable locally, within a reasonably small window
of n-grams. Corrections are limited to single words,
targetting a single character or a very small number of
characters by insertion, omission, substitution, transposition,
addition, deletion or substitution of a diacritic. Errors that
involve joining or splitting of word tokens or word-order
errors of any type are not handled at the moment.</p>
        <p>
          The performance of Korektor was evaluated first in
Štindlová et al. (2012) with about 20% error rate on the
set of non-words, and later in
          <xref ref-type="bibr" rid="ref5">Ramasamy et al. (2015)</xref>
          . In
an optimal setting of the model, the best results achieved
in terms of F1 score were 95.4% for error detection and
91.0% for error correction. In a manual analysis of 3000
tokens, about 23% of the tokens included either a form
11Czech as a Second Language with Spelling, Grammar and Tags
12http://hdl.handle.net/11234/1-162
13See
          <xref ref-type="bibr" rid="ref6">Richter et al. (2012)</xref>
          . The tool is available from the LINDAT
repository (https://lindat.mff.cuni.cz) under the FreeBSD license.
14See
          <xref ref-type="bibr" rid="ref4">Jelínek et al. (2012)</xref>
          .
error at Tier 1 (62%), a grammar error at Tier 2 (27%),
or an accumulated error at both tiers (11%). Form errors
were detected with a success rate of 89%. For grammar
errors (real-word errors) the detection rate was much lower,
about 15.5%. The detection of accumulated errors was
similar to form errors (89%).
        </p>
        <p>After all the automatic annotation steps are finished,
each token is labelled by the following attributes:
• word – original word form
• lemma – lemma of word; same as word if the form is
not recognized
• tag – morphological tag of word; if the form is not
recognized:
X@------------• word1 – corrected form; same as word if determined
as correct
• lemma1 – lemma of word1
• tag1 – morphological tag of word1
• gs – information on whether the error was
determined as a spelling (S) or grammar (G) error; for
grammar errors, word is mostly recognized
• err – error type, determined by comparing word and
word1.</p>
        <p>In addition to the attributes listed above, the search
interface of the Czech National Corpus offers “dynamic”
attributes, derived from some positions of tag and tag1.
Dynamic attributes can be used in queries to specify
values of morphological categories without regular
expressions, to stipulate identity of these values in two or more
forms to require grammatical concord, or to compare
values of a category for word and word1. These attributes
are available for the following categories of the original
and the corrected form:
• k, k1 – word class (position 1 of the tag)
• s, s1 – detailed word class (position 2 of the tag)
• g, g1 – gender (position 3 of the tag)
• n, n1 – number (position 4 of the tag)
• c, c1 – case (position 5 of the tag)
15The example comes from a CzeSL-SGT text, written by a 17 years
old student, with Russian as L1 and B2 as the proficiency level in Czech
(document ID ttt_G1_434).
nIE
unknown
Texts
Sentences
Words
Tokens
Different authors
Different L1s
Proficiency levels
Women/Men
Words per text</p>
        <p>A1
A1+
A2
A2+
B1
B2
C1
• p, p1 – person (position 8 of the tag)</p>
        <p>They are meant especially for CQL queries16 including
a “global condition”. As in standard corpora, such queries
target two or more word tokens with an arbitrary but equal
value of an attribute such as case to express grammatical
agreement and similar morphosyntactic phenomena (2).
(2)
In a learner corpus, such queries make sense even for a
single word token, e.g. for expressing identical or distinct
values of the morphological case of the original form and
of its corrected version (3).17
(3)</p>
        <p>1:[] &amp; 1.c != 1.c1
In a learner corpus, metadata about the author of the text
are at least as important as all other types of annotation.
For the number of texts authored by students according
to their first language and the CEFR proficiency level in
Czech see Table 4 below. The language group
abbreviations read as follows: IE = non-Slavic Indo-European, nIE
= non-Indo-European, S = Slavic.</p>
        <p>A1
A1+
A2
A2+
B1
B2
C1
C2
unknown
CzeSL-man v. 1 is a collection of manually annotated
transcripts of essays of non-native speakers of Czech, written
in 2009–2013, the total of 645 texts, including 298 doubly
annotated texts. The texts contain 128 thousand word
tokens, including 59 thousand doubly annotated tokens; for
a comparison with CzeSL-SGT see Table 5.</p>
        <p>Tables 6 and 7 show the number of texts for each
combination of CEFR level and language group in CzeSL-man
v. 1.</p>
        <p>16See https://www.sketchengine.co.uk/corpus-querying/
17Unfortunately, queries including global conditions on dynamic
attributes do not produce expected results in the present version of the
Manatee search engine.
645
11K
104K
128K
262
32</p>
        <p>In addition to the number of tokens for the same
category, Table 8 shows also the frequency of errors of the dep
type, i.e. valency errors in the broad sense, including
errors in the number of complements and adjuncts or errors
in their morphosyntactic expression. The rather frequent
error type shows a considerable and expected decrease in
higher proficiency levels</p>
        <p>CzeSL-man v. 1 is about to be released soon for
download in the LINDAT repository and for on-line searching
in https://kontext.korpus.cz. Some solutions to the
problem of using a feature-rich corpus search engine, which
is still not suited to the two-level annotation scheme of
CzeSL-man, are presented in 4.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Some issues and lessons learnt</title>
      <p>Several points can be made about some of the CzeSL
releases, reflecting issues involved in the design,
compilation and presentation of learner corpora.</p>
      <p>We start with CzeSL-plain and its hand-annotated part
CzeSL-man v. 0: (i) Both corpora include some ROMi
texts, actually produced by native speakers of a dialect
of Czech, rather than by non-native speakers of Czech.
This is due to the original strategy of grouping texts by
the way they are processed. This has been changed in later
releases, where texts produced by non-native and native
learners (the latter including speakers of the Romani
ethnolect of Czech) are parts of distinct corpora. (ii) Neither
IE
dep
%dep
nIE
dep
%dep
S
dep
%dep
Σ
dep
%dep
CzeSL-plain nor CzeSL-man v. 0 includes the full set of
metadata, which were not available in the appropriate form
and content at the time the two corpora were prepared and
released. In CzeSL-plain, the texts are categorized into
three groups: as essays, written either by non-native
learners, or by speakers of the Roma ethnolect of Czech, and as
theses written by non-native students. In CzeSL-man v. 0
there is no distiction available. (iii) Due to the uncertainty
abouth the optimal way of representing the complex
twolevel manual annotation, the SeLaQ tool cannot display the
two-level annotation format in a graphical format.</p>
      <p>There is a strong demand for CzeSL-man to become
available for on-line searches at the Czech National
Corpus portal, even if some of the properties and information
present in the corpus may get lost in the conversion to the
format used by the corpus search tool, based on the
singlelevel annotation of a string of tokens. However, the
converted format might still retain enough annotation to be
attractive and useful for most tasks. Instead of assigning the
error-related annotation to word tokens, which makes the
option to annotate strings of tokens, or even discontinuous
strings very difficult, errors and corrections can be treated
as structural annotation, i.e. similarly to the markup for
paragraphs, sentences, phrases or text chunks. Even the
splitting and joining of words and word order corrections
can then be expressed.</p>
      <p>The Manatee corpus search engine, used in the Czech
National Corpus, and its (No)Sketch Engine front end
actually include support for learner corpora,18. The in-line
annotation can even have embedded structures, which may
be used at least for some cases of multi-layered annotation.
Making CzeSL-man with most of the annotation available
this way thus seems a real prospect.
The target corpus may be intended for a group of users
with specific research or practical needs, or for a wide
audience of language acquisition experts, researchers or
practitioners. In any case the goals should be realistic
in order to avoid a mission ending before the goals are
achieved.
Some balance or at least representative proportions of text
and learner categories are necessary or at least useful.
Tables 4–7 show an opposite, opportunistic approach, driven
by practical constraints, often justified by the unavailablity
of texts of a specific category.
We have presented several releases of a learner corpus of
Czech, available for on-line queries and under the Creative
Commons license as full texts.</p>
      <p>In order to reach its goals and become useful, a learner
corpus project should be conceived carefully, considering
many factors. By way of an example, we have shown some
pitfalls in the process of building and presenting such a
corpus.</p>
      <p>The methods and tools developed within this project are
not tied to the specific use and we hope they will be found
useful in other projects.</p>
      <p>18See https://www.sketchengine.co.uk/learner-corpus-functionality/</p>
      <sec id="sec-3-1">
        <title>Acknowledgements</title>
        <p>The corpus could never be built without many other
members of the CzeSL team. For the work reported here the
author is grateful especially to Barbora Štindlová, Jirka
Hana and Tomáš Jelínek. The author’s thanks are also due
to two anonymous reviewers who helped to improve the
paper, and to the Grant Agency of the Czech Republic,
which currently provides financial support for Non-native
Czech from the Theoretical and Computational
Perspective (project ID 16-10185S).
Corpus
ICLE
CLC
LINDSEI
PELCRA
USE
HKUST
CHUNGDAHM
JEFLL
MELD
MICASE
NICT JLE
RusLTC
FALKO
FRIDA
FLLOC
PiKUST
ASU
TUFS
slavnou prahu , proto to bylo
famous Prague , therefore it was
Building and Using Corpora of Non-Native Czech
word lemma tag word1 lemma1 tag1 gs err
Tén Tén X@------------- Ten ten PDYS1---------- S Quant1
pes pes NNMS1-----A---- pes pes
NNMS1-----A---míluje míluje X@------------- miluje milovat VB-S---3P-AA--- S Quant1
svécho svécho X@------------- svého svu˚j P8MS4---------- S Voiced
kamarada kamarada X@------------- kamaráda kamarád NNMS4-----A---- S Quant0
- - Z:------------- - -
Z:------------cˇloveˇka cˇloveˇk NNMS2-----A---- cˇloveˇka cˇloveˇk
NNMS4-----A---. . Z:------------- . .
Z:------------</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Boyd</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hana</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nicolas</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meurers</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wisniewski</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schöne</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Štindlová</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Vettori</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>The MERLIN corpus: Learner language and the CEFR</article-title>
          . In Calzolari, N.,
          <string-name>
            <surname>Choukri</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Declerck</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loftsson</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maegaard</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mariani</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Odijk</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Piperidis</surname>
          </string-name>
          , S., editors,
          <source>Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)</source>
          , Reykjavik, Iceland.
          <source>European Language Resources Association (ELRA).</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Hana</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Škodová</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Štindlová</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Error-tagged learner corpus of Czech</article-title>
          .
          <source>In Proceedings of the Fourth Linguistic Annotation Workshop</source>
          , Uppsala, Sweden. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Hana</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Štindlová</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , and Šteˇpánek, J. (
          <year>2014</year>
          ).
          <article-title>Building a learner corpus</article-title>
          .
          <source>Language Resources and Evaluation</source>
          ,
          <volume>48</volume>
          (
          <issue>4</issue>
          ):
          <fpage>741</fpage>
          -
          <lpage>752</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Jelínek</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Štindlová</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Hana</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Combining manual and automatic annotation of a learner corpus</article-title>
          . In Sojka, P.,
          <string-name>
            <surname>Horák</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kopecˇek</surname>
          </string-name>
          , I., and
          <string-name>
            <surname>Pala</surname>
          </string-name>
          , K., editors,
          <source>Text, Speech and Dialogue - Proceedings of the 15th International Conference TSD 2012, number 7499 in Lecture Notes in Computer Science</source>
          , pages
          <fpage>127</fpage>
          -
          <lpage>134</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Ramasamy</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and Stranˇák,
          <string-name>
            <surname>P.</surname>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Improvements to Korektor: A case study with native and non-native Czech</article-title>
          . In Yaghob, J., editor,
          <source>ITAT 2015: Information technologies - Applications and Theory / SloNLP</source>
          <year>2015</year>
          , pages
          <fpage>73</fpage>
          -
          <lpage>80</lpage>
          , Prague. Charles University in Prague.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Richter</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Stranˇák</given-names>
            , P., and
            <surname>Rosen</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Korektor - a system for contextual spell-checking and diacritics completion</article-title>
          .
          <source>In Proceedings of COLING 2012: Posters</source>
          , pages
          <fpage>1019</fpage>
          -
          <lpage>1028</lpage>
          , Mumbai, India.
          <source>The COLING 2012 Organizing Committee.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Rosen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hana</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Štindlová</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Feldman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Evaluating and automating the annotation of a learner corpus. Language Resources and Evaluation - Special Issue: Resources for language learning</article-title>
          ,
          <volume>48</volume>
          (
          <issue>1</issue>
          ):
          <fpage>65</fpage>
          -
          <lpage>92</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Wisniewski</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Woldt</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schöne</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blaschitz</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Štindlová</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , and Vodicˇková,
          <string-name>
            <surname>K.</surname>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>The MERLIN annotation scheme for the annotation of German, Italian, and Czech learner language</article-title>
          .
          <source>Technical report</source>
          . Available online http://merlin-platform.eu/.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Šebesta</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Korpusy cˇeštiny a osvojování jazyka [Corpora of Czech and language acquistion]</article-title>
          .
          <source>Studie z aplikované lingvistiky/Studies in Applied Linguistics</source>
          ,
          <volume>1</volume>
          :
          <fpage>11</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Štindlová</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2011a</year>
          ).
          <article-title>Evaluace chybové anotace navržené pro žákovský korpus cˇeštiny</article-title>
          .
          <source>SALi</source>
          ,
          <volume>2</volume>
          (
          <issue>2</issue>
          ):
          <fpage>37</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Štindlová</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2011b</year>
          ).
          <article-title>Evaluace chybové anotace v žákovském korpusu cˇeštiny [Evaluation of Error MarkUp in a Learner Corpus of Czech]</article-title>
          .
          <source>PhD thesis</source>
          , Charles University, Faculty of Arts, Prague.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Štindlová</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hana</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Škodová</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>CzeSL - an error tagged corpus of Czech as a second language</article-title>
          . In Pe˛zik, P., editor,
          <source>Corpus Data across Languages and Disciplines</source>
          , volume
          <volume>28</volume>
          of Łódz´ Studies in Language, pages
          <fpage>21</fpage>
          -
          <lpage>32</lpage>
          , Frankfurt am Main. Peter Lang.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>