=Paper= {{Paper |id=Vol-1649/80 |storemode=property |title=Building and Using Corpora of Non-Native Czech |pdfUrl=https://ceur-ws.org/Vol-1649/80.pdf |volume=Vol-1649 |authors=Alexandr Rosen |dblpUrl=https://dblp.org/rec/conf/itat/Rosen16 }} ==Building and Using Corpora of Non-Native Czech== https://ceur-ws.org/Vol-1649/80.pdf
ITAT 2016 Proceedings, CEUR Workshop Proceedings Vol. 1649, pp. 80–87
http://ceur-ws.org/Vol-1649, Series ISSN 1613-0073, c 2016 A. Rosen



                                   Building and using corpora of non-native Czech

                                                                       Alexandr Rosen

                                         Institute of Theoretical and Computational Linguistics, Faculty of Arts
                                                               Charles University in Prague

      1    Introduction                                                             The tabular format is also used in MERLIN, one of the
                                                                                 two currently available corpora including Czech.2 In ad-
      Investigating language acquisition by non-native learners                  dition to 64.5K words of Czech in CEFR levels A1–C1,
      helps to understand important linguistic issues and develop                the corpus includes also German and Italian. It is tagged,
      teaching methods, better suited both to the specific target                lemmatized, parsed and on-line searchable, with a detailed
      language and to the learner. These tasks can now be based                  error taxonomy and the option of two target hypotheses.
      on empirical evidence from learner corpora.
         A learner corpus consists of language produced by lan-
      guage learners, typically learners of a second or foreign                  3    CzeSL – the learner corpus of Czech as a
      language (L2). Such corpora may be equipped with mor-                           Second Language
      phological and syntactic annotation, together with the de-
      tection, correction and categorization of non-standard lin-                CzeSL is a part of an umbrella project, the Acquisition
      guistic phenomena.                                                         Corpora of Czech (AKCES), a research programme pur-
         The tasks of designing, compiling, annotating and pre-                  sued since 2005 (Šebesta, 2010). In addition to CzeSL,
      senting such corpora are often very much unlike those rou-                 AKCES has a written (SKRIPT) and spoken (SCHOLA)
      tinely applied to standard corpora. There may be no stan-                  part collected from native Czech pupils, and ROMi, a part
      dard or obvious solutions: the approach to the tasks is of-                collected from pupils with Romani background, using the
      ten seen as an answer to a specific research goal rather                   Romani ethnolect of Czech as their first language (L1). In
      than as a service to a wider community of researchers and                  the present paper we focus on written texts produced by
      practitioners. Our aim is to investigate some of the chal-                 non-native learners of Czech. However, most of the meth-
      lenges, based on a learner corpus of Czech in comparison                   ods and tools can be applied to other parts of the corpus.
      to several other learner corpora.                                             CzeSL is focused on native speakers of three main lan-
         After an overview of learner corpora around the world                   guage groups: (1) Slavic, (2) other Indo-European, (3)
      in §2 and a brief presentation of several releases of a                    non-Indo-European. The hand-written texts cover all lan-
      learner corpus of Czech in §3, we examine issues inherent                  guage levels, from real beginners (A1) to advanced learn-
      to the process of compiling, annotating and using such cor-                ers (B2, C1, C2). The texts are equipped with metadata
      pora, including automatic identification of errors, the de-                records; some of them relate to the respondent (age, gen-
      sign and application of error taxonomy, and a user-friendly                der, first language, proficiency in Czech, knowledge of
      search tool, suited to a complex annotation (§4).                          other languages, duration and conditions of language ac-
                                                                                 quisition), while other specify the character of the text and
                                                                                 circumstances of its production (availability of reference
      2    About learner corpora                                                 tools, type of elicitation, temporal and size restrictions
                                                                                 etc.).
      Most of the existing learner corpora include English (L2)                     The hand-written texts were transcribed using off-the-
      as produced by students whose native languages (L1) are                    shelf editors supporting HTML (e.g., Microsoft Word or
      varied. Most of the corpora are partially error-annotated,                 Open Office Writer). A set of codes was used to cap-
      see Table 1 on p. .1 The error annotation is usually in-                   ture variants, illegible strings, self-corrections; for details
      line, equivalent to XML tags, denoting the scope, correc-                  see (Štindlová, 2011b, p. 106ff). During the transcrip-
      tion and categorization of an error. A few corpora such                    tion step, the texts were anonymized by replacing personal
      as FALKO include multi-layered annotation in a tabular                     names with appropriate forms of Adam and Eva. Names
      format, with the option of specifying multiple target hy-                  of smaller places (streets, villages, small towns) and other
      potheses (corrections) and several error types for single                  potentially sensitive data were replaced by QQQ. Unread-
      word tokens or strings thereof at different levels of linguis-             able characters or words were transcribed as XXX.
      tic abstraction: orthography, morphology, syntax, lexicon,                    The transcripts were converted into an XML format.
      pragmatics, intelligibility.                                               Some of them were corrected (‘emended’) and labelled

                                                                                     2 Multilingual Platform for European Reference Levels: Interlan-
          1 For a more extensive overview see Štindlová (2011a) or an actively   guage Exploration in Context, see http://merlin-platform.eu and Wis-
      maintained list at https://www.uclouvain.be/en-cecl-lcworld.html.          niewski et al. (2014); Boyd et al. (2014)
Building and Using Corpora of Non-Native Czech                                                                                                                  81

     by error categories using a custom-built annotation edi-                   lation.8 The level of transcribed input (Tier 0) is followed
     tor, supporting a two-layered annotation format with m : n                 by the level of orthographical and morphemic corrections
     links between tokens at the neighbouring tiers.3 In a post-                (Tier 1), where only forms incorrect in any context are
     processing step the hand-annotated texts were tagged by                    treated. Errors at Tier 1 are mainly non-word errors while
     tools trained on native Czech in a way similar to stan-                    those at Tier 2 are real-word and grammatical errors. How-
     dard corpora, i.e. by lemmas, morphosyntactic categories,                  ever, a faulty form that happens to be spelled as a form
     in some (currently non-public) releases of the corpus also                 which would be correct in a different context, is still cor-
     by syntactic functions and structure. Some error annota-                   rected at Tier 1. The result at Tier 1 is a string consist-
     tion tasks were also done automatically: the assignment of                 ing of correct Czech forms, even though the sentence may
     formal error labels and even the correction step (the latter               not be correct as a whole. All other types of errors are
     in Czesl-SGT, see §3.2).                                                   corrected at Tier 2, representing a grammatically correct,
        There are several public releases of CzeSL, which dif-                  though stylistically not necessarily optimal target hypothe-
     fer in the depth and method of annotation, but also in the                 sis.9 Manual annotation is complemented by morphosyn-
     availability of metadata and size. Table 2 shows the con-                  tactic tags and lemmas at Tier 2, ambiguously specified
     tent of available releases of CzeSL, including the volumes                 tags and lemmas at Tier 1, and automatically identified for-
     (in thousands of tokens), and the availability of annotation               mal errors.10 Splitting, joining and reordering words, to-
     and metadata.4                                                             gether with the pointers may make the picture rather com-
                                                                                plex, as in an authentic sentence in Figure 1 on p. .
                                                                                   The three tiers are represented as parallel strings of
     3.1     Releases of CzeSL without metadata:
                                                                                word forms with links for corresponding forms. Tier 0
             CzeSL-plain and CzeSL-man v. 0
                                                                                is glossed for readability; forms marked by asterisks are
     Since 2012, the transcripts of essays hand-written by non-                 incorrect in any context.
     native learners (1.3 mil. tokens) and pupils speaking the                     Errors corrected at Tier 1 include incorrect inflec-
     Romani ethnolect of Czech (0.4 mil. tokens) have been                      tion (incorInfl), word boundaries (wbdPre), and stems
     available together with some Bachelor and Master the-                      (incorBase). Errors in punctuation (the missing comma),
     ses written in Czech by foreign students (0.7 mil. tokens)                 capitalization (prahu) or word order (se in the that-clause
     as the CzeSL-plain corpus, on-line searchable via a web-                   at Tier 2) are tagged automatically in a post-processing
     based search interface of the Czech National Corpus,5 or                   step.
     as full texts under the Creative Commons license from                         Tier 2 captures the rest of errors. Some error labels are
     the LINDAT repository.6 Except for specifying the three                    linked to a token which makes the reason for the correc-
     groups above and a basic structural mark-up, this corpus                   tion explicit. This includes errors in agreement (agr), gov-
     does not include any metadata or annotation.                               ernment or valency in a broad sense (dep), complex verb
        CzeSL-man v. 0 includes subsets of CzeSL and ROMi,                      forms (vbx) or reflexive particles (rflx). For example, ona
     about 330 thousand tokens. It is manually error-annotated                  in the nominative case is governed by the form líbit se, and
     at two levels. Texts of about 208 thousand tokens are anno-                should be in the dative case: jí. The label dep has an ar-
     tated independently by two annotators. Like CzeSL-plain,                   row pointing to the governor líbit. There is also a simple
     the whole hand-annotated part is accessible online with-                   lexical correction: Proto ‘therefore’ is changed to protože
     out metadata via a purpose-built search tool (SeLaQ);7 for                 ‘because’.
     more about the manual annotation and the annotation pro-                      However, the main issue are the two finite verbs bylo
     cess see Hana et al. (2014).                                               and vadí. The most likely intention of the author is best ex-
        The manual annotation scheme in CzeSL is based on                       pressed by the conditional mood. The two non-contiguous
     a two-stage annotation design, reflecting the distinction                  forms are replaced by the conditional auxiliary and the
     roughly between errors in orthography and morphemics                       content verb participle in one step using a 2:2 relation.
     on the one hand and all other error types on the other. To-                Another complex issue is the prepositional phrase pro mně
     kens in the original transcript are linked with their coun-                ‘for me’. Its proper form is pro mě (homonymous with pro
     terparts at the two successive levels by edges, possibly                   mně, but with ‘me’ in accusative instead of dative), or pro
     labelled with the type of error – see Figure 1 on p. . A                   mne. The accusative case is required by the preposition
     syntactic error label may be linked by a pointer to a word                 pro. However, the head verb requires that this comple-
     token, specifying an agreement, valency or referential re-                 ment bears bare dative – mi. Additionally, this form is a
                                                                                     8 This scheme is already a compromise between a linear annotation

                                                                                and an open multi-layered format, but a compromise preserving links be-
           3 https://bitbucket.org/jhana/feat
                                                                                tween split, joined and re-ordered tokens, corrected in two stages simul-
          4 Some texts in CzeSL-man v.0 are doubly annotated. The texts an-
                                                                                taneously, something not obviously supported in the multilayered tabular
     notated by an additional annotator are included in the CzeSL-man v.0, a2   format mentioned above in §2.
     part. See http://utkl.ff.cuni.cz/learncorp/ for links and more details.         9 See Hana et al. (2010) and Rosen et al. (2014) for more details.
          5 https://kontext.korpus.cz                                              10 See Jelínek et al. (2012) for details, including a list of formal error
          6 http://lindat.mff.cuni.cz
                                                                                types. The last column of Table 3 shows examples of the formal error
          7 http://chomsky.ruk.cuni.cz:5125                                     labels.
82                                                                                                                                                 A. Rosen

     clitic, following the conditional auxiliary.                             error at Tier 1 (62%), a grammar error at Tier 2 (27%),
        The correction slavnouaccusative →slavnánominative is due             or an accumulated error at both tiers (11%). Form errors
     to the correction of the case of the head noun. Such cor-                were detected with a success rate of 89%. For grammar er-
     rections receive an additional label as secondary errors.                rors (real-word errors) the detection rate was much lower,
                                                                              about 15.5%. The detection of accumulated errors was
                                                                              similar to form errors (89%).
     3.2   The automatically anotated CzeSL-SGT
                                                                                 After all the automatic annotation steps are finished,
     The ‘real’ CzeSL, i.e. the corpus consisting of essays writ-             each token is labelled by the following attributes:
     ten only by non-native learners (1.1 mil. tokens), is avail-
     able with automatic annotation as CzeSL-SGT,11 extend-                      • word – original word form
     ing the “foreign” part of the CzeSL-plain corpus by texts                   • lemma – lemma of word; same as word if the form is
     collected in 2013. This was the first release of CzeSL in-                    not recognized
     cluding full metadata. The corpus includes 8,617 texts by
     1,965 different authors with 54 different first languages.                  • tag – morphological tag of word; if the form is not
     The original transcription markup is discarded in this cor-                   recognized: X@-------------
     pus, while the final author’s version is restored. The cor-
     pus is available again either for on-line searching using                   • word1 – corrected form; same as word if determined
     the search interface of the Czech National Corpus or for                      as correct
     download from the LINDAT data repository.12                                 • lemma1 – lemma of word1
        Word forms are tagged by word class, morphological
     categories and base forms (lemmas). Some forms are cor-                     • tag1 – morphological tag of word1
     rected by Korektor, a context-sensitive spelling/grammar
     checker,13 and the resulting texts are tagged again. Origi-                 • gs – information on whether the error was deter-
     nal and corrected forms are compared and error labels are                     mined as a spelling (S) or grammar (G) error; for
     assigned. Korektor detected and corrected 13.24% incor-                       grammar errors, word is mostly recognized
     rect forms, 10.33% labelled as including a spelling error,                  • err – error type, determined by comparing word and
     and 2.92% an error in grammar, i.e. a ‘real-word’ error.                      word1.
     Both the original, uncorrected texts and their corrected
     version were tagged and lemmatized, and “formal error                      Table 3 on p. shows the use of the annotation in a sim-
     tags,” based on the comparison of the uncorrected and cor-               ple sentence (1).15
     rected forms, were assigned.14 The share of non-words de-
     tected by the tagger is slightly lower – 9.23% (the tagger               (1)      Tén pes míluje svécho kamarada – člověka.
     uses a larger lexicon).                                                           that dog loves self’s friend       – man
        Automatic correction is a crucial annotation step. The                         ‘That dog loves its friend – the man.’
     tool is concerned mainly with errors in orthography and
                                                                                 In addition to the attributes listed above, the search in-
     morphemics, and handles some errors in morphosyntax,
                                                                              terface of the Czech National Corpus offers “dynamic” at-
     including real-word errors (i.e. errors that produce a word
                                                                              tributes, derived from some positions of tag and tag1.
     which seems to be correct out of context), as long as they
                                                                              Dynamic attributes can be used in queries to specify val-
     are detectable locally, within a reasonably small window
                                                                              ues of morphological categories without regular expres-
     of n-grams. Corrections are limited to single words, tar-
                                                                              sions, to stipulate identity of these values in two or more
     getting a single character or a very small number of char-
                                                                              forms to require grammatical concord, or to compare val-
     acters by insertion, omission, substitution, transposition,
                                                                              ues of a category for word and word1. These attributes
     addition, deletion or substitution of a diacritic. Errors that
                                                                              are available for the following categories of the original
     involve joining or splitting of word tokens or word-order
                                                                              and the corrected form:
     errors of any type are not handled at the moment.
        The performance of Korektor was evaluated first in                       • k, k1 – word class (position 1 of the tag)
     Štindlová et al. (2012) with about 20% error rate on the
     set of non-words, and later in Ramasamy et al. (2015). In                   • s, s1 – detailed word class (position 2 of the tag)
     an optimal setting of the model, the best results achieved
     in terms of F1 score were 95.4% for error detection and                     • g, g1 – gender (position 3 of the tag)
     91.0% for error correction. In a manual analysis of 3000                    • n, n1 – number (position 4 of the tag)
     tokens, about 23% of the tokens included either a form
                                                                                 • c, c1 – case (position 5 of the tag)
        11 Czech as a Second Language with Spelling, Grammar and Tags
        12 http://hdl.handle.net/11234/1-162
        13 See Richter et al. (2012). The tool is available from the LINDAT       15 The example comes from a CzeSL-SGT text, written by a 17 years

     repository (https://lindat.mff.cuni.cz) under the FreeBSD license.       old student, with Russian as L1 and B2 as the proficiency level in Czech
        14 See Jelínek et al. (2012).                                         (document ID ttt_G1_434).
Building and Using Corpora of Non-Native Czech                                                                                                    83

        • p, p1 – person (position 8 of the tag)                                                           CzeSL-SGT       CzeSL-man v. 1
                                                                                    Texts                         8,600              645
        They are meant especially for CQL queries16 including                       Sentences                     111K               11K
     a “global condition”. As in standard corpora, such queries
                                                                                    Words                         958K              104K
     target two or more word tokens with an arbitrary but equal
                                                                                    Tokens                      1,148K              128K
     value of an attribute such as case to express grammatical
     agreement and similar morphosyntactic phenomena (2).                           Different authors            1,965               262
                                                                                    Different L1s                   54                32
     (2)      1:[] 2:[] & 1.c = 2.c                                                 Proficiency levels          A1–C2             A1–C1
     In a learner corpus, such queries make sense even for a                        Women/Men                      5:3               3:2
     single word token, e.g. for expressing identical or distinct                   Words per text             100–200           100–200
     values of the morphological case of the original form and
                                                                                       Table 5: CzeSL-man v. 1 and CzeSL-SGT compared
     of its corrected version (3).17
     (3)      1:[] & 1.c != 1.c1                                                                          S     IE   nIE    unknown         Σ
                                                                                          A1             49      6     4                 59
     In a learner corpus, metadata about the author of the text                           A1+                          3                  3
     are at least as important as all other types of annotation.                          A2              18    26    67                111
     For the number of texts authored by students according
                                                                                          A2+             81     9    59                149
     to their first language and the CEFR proficiency level in
                                                                                          B1             123    26    30                179
     Czech see Table 4 below. The language group abbrevia-
     tions read as follows: IE = non-Slavic Indo-European, nIE                            B2             102    11    15                128
     = non-Indo-European, S = Slavic.                                                     C1              10           2                 12
                                                                                          unknown                                   4     4
                          S        IE      nIE       unknown            Σ                 Σ              383    78   180            4   645
        A1             1783       199      622             5         2609
                                                                                   Table 6: Number of texts by language group and profi-
        A1+             283        21       11             0          315          ciency level in CzeSL-man v. 1
        A2             1348       269      480             1         2098
        A2+             403        54      113             0          570             In addition to the number of tokens for the same cate-
        B1              929       195      357             0         1481          gory, Table 8 shows also the frequency of errors of the dep
        B2              523       115      107             0          745          type, i.e. valency errors in the broad sense, including er-
        C1               82        17       24             0          123          rors in the number of complements and adjuncts or errors
        C2                0         1        0             0            1          in their morphosyntactic expression. The rather frequent
                                                                                   error type shows a considerable and expected decrease in
        unknown         291        27       33           324          675          higher proficiency levels
        Σ              5642       898     1747           330         8617             CzeSL-man v. 1 is about to be released soon for down-
                                                                                   load in the LINDAT repository and for on-line searching
     Table 4: Number of texts by language group and profi-                         in https://kontext.korpus.cz. Some solutions to the prob-
     ciency level in CzeSL-SGT                                                     lem of using a feature-rich corpus search engine, which
                                                                                   is still not suited to the two-level annotation scheme of
                                                                                   CzeSL-man, are presented in 4.
     3.3    CzeSL-man v. 1
     CzeSL-man v. 1 is a collection of manually annotated tran-                    4    Some issues and lessons learnt
     scripts of essays of non-native speakers of Czech, written
     in 2009–2013, the total of 645 texts, including 298 doubly                    Several points can be made about some of the CzeSL re-
     annotated texts. The texts contain 128 thousand word to-                      leases, reflecting issues involved in the design, compila-
     kens, including 59 thousand doubly annotated tokens; for                      tion and presentation of learner corpora.
     a comparison with CzeSL-SGT see Table 5.                                         We start with CzeSL-plain and its hand-annotated part
        Tables 6 and 7 show the number of texts for each com-                      CzeSL-man v. 0: (i) Both corpora include some ROMi
     bination of CEFR level and language group in CzeSL-man                        texts, actually produced by native speakers of a dialect
     v. 1.                                                                         of Czech, rather than by non-native speakers of Czech.
                                                                                   This is due to the original strategy of grouping texts by
        16 See https://www.sketchengine.co.uk/corpus-querying/
                                                                                   the way they are processed. This has been changed in later
         17 Unfortunately, queries including global conditions on dynamic at-      releases, where texts produced by non-native and native
     tributes do not produce expected results in the present version of the Man-   learners (the latter including speakers of the Romani eth-
     atee search engine.                                                           nolect of Czech) are parts of distinct corpora. (ii) Neither
84                                                                                                                                                      A. Rosen

                                    S    IE      nIE       Σ                          The Manatee corpus search engine, used in the Czech
                       A1       37        2       1        40                      National Corpus, and its (No)Sketch Engine front end ac-
                       A1+                        3         3                      tually include support for learner corpora,18 . The in-line
                       A2        5       23      47        75                      annotation can even have embedded structures, which may
                       A2+      21        6      49        76                      be used at least for some cases of multi-layered annotation.
                                                                                   Making CzeSL-man with most of the annotation available
                       B1       20       23      28        71
                                                                                   this way thus seems a real prospect.
                       B2        7       11      12        30
                       C1        1                2         3
                       Σ        91       65      142   298                         4.1    Corpus design and planning

     Table 7: Number of doubly annotated texts by language                         The target corpus may be intended for a group of users
     group and proficiency level in CzeSL-man v. 1                                 with specific research or practical needs, or for a wide
                                                                                   audience of language acquisition experts, researchers or
                                                                                   practitioners. In any case the goals should be realistic
                  A1           A2           B1        B2           C1          Σ
                                                                                   in order to avoid a mission ending before the goals are
      IE         227        7,336        5,311     2,340            0     15,214
                                                                                   achieved.
      dep         13          361          118        28            0        520
      %dep    5.73%        4.92%        2.22%     1.20%                   3.42%
      nIE        439       17,640        7,606     4,219           760    30,664   4.2    Text acquisition
      dep         13          715          237       116             7     1,088
      %dep    2.96%        4.05%        3.12%     2.75%         0.92%     3.55%    Some balance or at least representative proportions of text
      S        6,434       16,939       27,226    22,173         4,761    77,533   and learner categories are necessary or at least useful. Ta-
      dep        225          470          652       443            17     1,807   bles 4–7 show an opposite, opportunistic approach, driven
      %dep    3.50%        2.77%        2.39%     2.00%         0.36%     2.33%    by practical constraints, often justified by the unavailablity
      Σ        7,100       41,915       40,143    28,732         5,521   123,411   of texts of a specific category.
      dep        251        1,546        1,007       587            24     3,415
      %dep    3.54%        3.69%        2.51%     2.04%         0.43%     2.77%
                                                                                   4.3    Transcription
     Table 8: Number of tokens and valency errors by language
                                                                                   To avoid the need of cleaning transcripts with improperly
     group and proficiency level in CzeSL-man v. 1
                                                                                   used mark-up, an editing tool including strict format con-
                                                                                   trols is preferable to a free-text editor.
     CzeSL-plain nor CzeSL-man v. 0 includes the full set of
     metadata, which were not available in the appropriate form                    4.4    Annotation scheme and searching
     and content at the time the two corpora were prepared and
     released. In CzeSL-plain, the texts are categorized into                      A scheme ideally suited to the data may turn into a prob-
     three groups: as essays, written either by non-native learn-                  lem later, if the consequences for the annotation process
     ers, or by speakers of the Roma ethnolect of Czech, and as                    and the use of the corpus are not foreseen. Standard con-
     theses written by non-native students. In CzeSL-man v. 0                      cordancers may require substantial tweaking of the data,
     there is no distiction available. (iii) Due to the uncertainty                while a custom-built tool may lack features of the tools
     abouth the optimal way of representing the complex two-                       developed for a long time. At the same time, most users of
     level manual annotation, the SeLaQ tool cannot display the                    this type of corpora definitely need a friendly interface.
     two-level annotation format in a graphical format.
        There is a strong demand for CzeSL-man to become                           5     Conclusion
     available for on-line searches at the Czech National Cor-
     pus portal, even if some of the properties and information                    We have presented several releases of a learner corpus of
     present in the corpus may get lost in the conversion to the                   Czech, available for on-line queries and under the Creative
     format used by the corpus search tool, based on the single-                   Commons license as full texts.
     level annotation of a string of tokens. However, the con-                        In order to reach its goals and become useful, a learner
     verted format might still retain enough annotation to be at-                  corpus project should be conceived carefully, considering
     tractive and useful for most tasks. Instead of assigning the                  many factors. By way of an example, we have shown some
     error-related annotation to word tokens, which makes the                      pitfalls in the process of building and presenting such a
     option to annotate strings of tokens, or even discontinuous                   corpus.
     strings very difficult, errors and corrections can be treated                    The methods and tools developed within this project are
     as structural annotation, i.e. similarly to the markup for                    not tied to the specific use and we hope they will be found
     paragraphs, sentences, phrases or text chunks. Even the                       useful in other projects.
     splitting and joining of words and word order corrections
     can then be expressed.                                                            18 See https://www.sketchengine.co.uk/learner-corpus-functionality/
Building and Using Corpora of Non-Native Czech                                                                                     85

     Acknowledgements                                                Wisniewski, K., Woldt, C., Schöne, K., Abel, A., Blas-
                                                                      chitz, V., Štindlová, B., and Vodičková, K. (2014). The
     The corpus could never be built without many other mem-          MERLIN annotation scheme for the annotation of Ger-
     bers of the CzeSL team. For the work reported here the           man, Italian, and Czech learner language. Technical re-
     author is grateful especially to Barbora Štindlová, Jirka        port. Available online http://merlin-platform.eu/.
     Hana and Tomáš Jelínek. The author’s thanks are also due
     to two anonymous reviewers who helped to improve the            Šebesta, K. (2010). Korpusy češtiny a osvojování jazyka
     paper, and to the Grant Agency of the Czech Republic,             [Corpora of Czech and language acquistion]. Studie
     which currently provides financial support for Non-native         z aplikované lingvistiky/Studies in Applied Linguistics,
     Czech from the Theoretical and Computational Perspec-             1:11–34.
     tive (project ID 16-10185S).
                                                                     Štindlová, B. (2011a). Evaluace chybové anotace navržené
                                                                        pro žákovský korpus češtiny. SALi, 2(2):37–60.
     References
                                                                     Štindlová, B. (2011b). Evaluace chybové anotace v
     Boyd, A., Hana, J., Nicolas, L., Meurers, D., Wisniewski,          žákovském korpusu češtiny [Evaluation of Error Mark-
       K., Abel, A., Schöne, K., Štindlová, B., and Vettori, C.         Up in a Learner Corpus of Czech]. PhD thesis, Charles
       (2014). The MERLIN corpus: Learner language and                  University, Faculty of Arts, Prague.
       the CEFR. In Calzolari, N., Choukri, K., Declerck, T.,
                                                                     Štindlová, B., Rosen, A., Hana, J., and Škodová, S. (2012).
       Loftsson, H., Maegaard, B., Mariani, J., Moreno, A.,
                                                                        CzeSL – an error tagged corpus of Czech as a sec-
       Odijk, J., and Piperidis, S., editors, Proceedings of the
                                                                        ond language. In P˛ezik, P., editor, Corpus Data across
       Ninth International Conference on Language Resources
                                                                        Languages and Disciplines, volume 28 of Łódź Studies
       and Evaluation (LREC’14), Reykjavik, Iceland. Euro-
                                                                        in Language, pages 21–32, Frankfurt am Main. Peter
       pean Language Resources Association (ELRA).
                                                                        Lang.
     Hana, J., Rosen, A., Škodová, S., and Štindlová, B. (2010).
       Error-tagged learner corpus of Czech. In Proceedings
       of the Fourth Linguistic Annotation Workshop, Uppsala,
       Sweden. Association for Computational Linguistics.
     Hana, J., Rosen, A., Štindlová, B., and Štěpánek, J. (2014).
       Building a learner corpus. Language Resources and
       Evaluation, 48(4):741–752.
     Jelínek, T., Štindlová, B., Rosen, A., and Hana, J. (2012).
        Combining manual and automatic annotation of a
        learner corpus. In Sojka, P., Horák, A., Kopeček, I., and
        Pala, K., editors, Text, Speech and Dialogue – Proceed-
        ings of the 15th International Conference TSD 2012,
        number 7499 in Lecture Notes in Computer Science,
        pages 127–134. Springer.
     Ramasamy, L., Rosen, A., and Straňák, P. (2015). Im-
       provements to Korektor: A case study with native and
       non-native Czech. In Yaghob, J., editor, ITAT 2015:
       Information technologies – Applications and Theory /
       SloNLP 2015, pages 73–80, Prague. Charles University
       in Prague.
     Richter, M., Straňák, P., and Rosen, A. (2012). Korektor
       – a system for contextual spell-checking and diacritics
       completion. In Proceedings of COLING 2012: Posters,
       pages 1019–1028, Mumbai, India. The COLING 2012
       Organizing Committee.
     Rosen, A., Hana, J., Štindlová, B., and Feldman, A.
       (2014). Evaluating and automating the annotation of
       a learner corpus. Language Resources and Evalua-
       tion – Special Issue: Resources for language learning,
       48(1):65–92.
86                                                                                                                                                                     A. Rosen




                       Corpus                        Size (MW)                   L1     L2        Level          Medium            Annotation
                       ICLE                          3                           26      en     advanced             written              part
                       CLC                          35                      130          en           all            written              part
                       LINDSEI                       0.8                         11      en     advanced             spoken               part
                       PELCRA                        0.5                         pl      en           all            written              part
                       USE                           1.2                         sv      en     advanced             written              no
                       HKUST                        25                           zh      en     advanced             written              part
                       CHUNGDAHM                   131                           ko      en           all            written              part
                       JEFLL                         0.7                         jp      en     beginners            written              part
                       MELD                          1                           16      en     advanced             written              no
                       MICASE                        1.8               various           en     advanced             spoken               no
                       NICT JLE                      2                           jp      en           all            spoken               part
                       RusLTC                        1.5                         ru      en     advanced             written              no
                       FALKO                         0.3                          5      de     advanced             written              part
                       FRIDA                         0.2               various           fr      med-adv             spoken               part
                       FLLOC                         2                           en      fr           all            spoken               no
                       PiKUST                        0.04                        18      sl     advanced             written              yes
                       ASU                           0.5               various           no     advanced             written              no
                       TUFS                          0.6 Mchars        various           jp           all            written              no

                                                  Table 1: A list of learner corpora around the world




                                                       Non-native
                                                                                  Ethnolect       TOTAL           Annotation            Metadata
                                                     Essays Theses
                       CzeSL-plain                       1315         732                 428          2475               no                no
                       CzeSL-SGT                         1147                                          1147             auto                yes
                       CzeSL-man v.0, a1                   134                            192           326           manual                no
                       CzeSL-man v.0, a2                    59                            149           208           manual                no
                       CzeSL-man v.1                       134                                          134           manual                yes

                                                           Table 2: Available releases of CzeSL




      Bojal        jsme                  že ona   se   ne bude          libila        slavnou prahu ,         proto to bylo               velmí      vadí pro mně .
     *feared       aux                  that she rflx not will          *like         famous Prague ,       therefore it was              *very     resent for me .

                                                                                                                                        incorBase
     incorInfl                                              wbdPre    incorBase
                                                                                                              proto     to bylo           velmi     vadí pro mně .
       Bál         jsme                 že   ona    se      nebude      líbila        slavnou Prahu ,
                                                                                                               lex                vbx                        dep
                   agr        rflx            dep                        vbx           agr,sec   dep

       Bál         jsem       se    ,   že   se     jí      nebude      líbit         slavná    Praha ,       protože to by       mi      velmi     vadilo         .
                                             that she would not like the famous city of Prague,                      because I would be very unhappy about it.
               I was afraid


                 Figure 1: Two-level manual annotation of a sentence in CzeSL, the English glosses are added
Building and Using Corpora of Non-Native Czech                                                                    87




            word         lemma        tag                  word1       lemma1     tag1              gs   err
            Tén          Tén          X@-------------      Ten         ten        PDYS1----------   S    Quant1
            pes          pes          NNMS1-----A----      pes         pes        NNMS1-----A----
            míluje       míluje       X@-------------      miluje      milovat    VB-S---3P-AA---   S    Quant1
            svécho       svécho       X@-------------      svého       svůj      P8MS4----------   S    Voiced
            kamarada     kamarada     X@-------------      kamaráda    kamarád    NNMS4-----A----   S    Quant0
            -            -            Z:-------------      -           -          Z:-------------
            člověka    člověk     NNMS2-----A----      člověka   člověk   NNMS4-----A----
            .            .            Z:-------------      .           .          Z:-------------

                                       Table 3: Annotation of a sample sentence in CzeSL-SGT