=Paper= {{Paper |id=Vol-1885/181 |storemode=property |title=FicTree: A Manually Annotated Treebank of Czech Fiction |pdfUrl=https://ceur-ws.org/Vol-1885/181.pdf |volume=Vol-1885 |authors=Tomáš Jelínek |dblpUrl=https://dblp.org/rec/conf/itat/Jelinek17 }} ==FicTree: A Manually Annotated Treebank of Czech Fiction== https://ceur-ws.org/Vol-1885/181.pdf
J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 181–185
CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 T. Jelínek



                     FicTree: a Manually Annotated Treebank of Czech Fiction

                                                             Tomáš Jelínek

                                                   Charles Univeristy, Faculty of Arts,
                                                        Prague, Czech Republic
                                                    Tomas.Jelinek@ff.cuni.cz

     Abstract: We present a manually annotated treebank of             gate cases where all parsers agree on a syntactic annotation
     Czech fiction, intended to serve as an addendum to the            of one token which differs from the manual annotation.
     Prague Dependency Treebank. The treebank has only
     166,000 tokens, so it does not serve as a good basis for
     training of NLP tools, but added to the PDT training data,        2    Composition of the Treebank
     it can help improve the annotation of texts of fiction. We
                                                                       The manually annotated treebank FicTree is composed of
     describe the composition of the corpus, the annotation pro-
                                                                       eight texts and longer fragments of texts from the genre of
     cess including inter-annotator agreement. On the newly
                                                                       fiction published in Czech from 1991 to 2007, with a total
     created data and the data of the PDT, we performed a
                                                                       of 166,437 tokens, 12,860 sentences. It is annotated ac-
     number of experiments with parsers (TurboParser, Parsito,
                                                                       cording to the PDT a-layer annotation guidelines [5]. The
     MSTParser and MaltParser). We observe that the exten-
                                                                       PDT data annotated on the analytical layer comprise, for
     sion of PDT training data by a part of the new treebank
                                                                       comparison, 1,503,739 tokens, 87,913 sentences. Seven of
     actually does improve the results of the parsing of liter-
                                                                       the eight texts which compose the FicTree treebank, were
     ary texts. We investigate cases where parsers agree on a
                                                                       included in the CNC corpus SYN2010 [7] (the eigth one
     different annotation than the manual one.
                                                                       was originally intended to be included in the SYN2010
                                                                       corpus too, but was removed in the balancing process).
     1    Introduction                                                 The size of the eight texts ranges from 4,000 to 32,000
                                                                       tokens, the average is 20,800 tokens. Most of the texts are
                                                                       written in original Czech (80%), the remaining 20% are
     The Czech National Corpus (CNC) has decided to enrich
                                                                       translations (from German and Slovak). Most of the texts
     the annotation of some of its large synchronous corpora
                                                                       belong to the fiction genre without any subgenre (accord-
     by syntactic annotation, using the formalism of the Prague
                                                                       ing to the classification of the CNC), one large text (18.2%
     Dependency Treebank (PDT) [4]. The parsers used for
                                                                       of all tokens) belongs to the subclass of memoirs, 5.9% to-
     syntactic annotation must be trained on manually anno-
                                                                       kens come from texts for children and youth.
     tated data, with only PDT data available now. To achieve a
                                                                       The language data included in the PDT and in FicTree
     reliable parsing, it is necessary to ensure the training data
                                                                       differ in many characteristics in a similar way to the dif-
     to be as close as possible to the target text, but in PDT,
                                                                       ferences between the whole genres of journalism and fic-
     the texts are only journalistic, while one third of the texts
                                                                       tion described above. In FicTree, there are significantly
     in representative corpora of synchronous written Czech of
                                                                       shorter sentences with an average of 12.9 tokens per sen-
     the CNC belongs to the fiction genre. In many ways, fic-
                                                                       tence compared to an average of 17.1 tokens per sentence
     tion differs considerably from the characteristics of jour-
                                                                       in PDT. The part-of-speech ratio is also significantly dif-
     nalistic texts, for example by a significantly lower propor-
                                                                       ferent, as shown in Table 1.
     tion of nouns versus verbs: in the journalistic genre, 33.8%
     tokens are nouns and 16.0% are verbs; in fiction, the ra-
                                                                          It is evident from the table that there is a significantly
     tio of nouns and verbs is almost equal, 24.3% tokens are
                                                                       lower proportion of nouns, adjectives and numerals in Fic-
     nouns, and 21.2% verbs (based on statistics [1] from the
                                                                       Tree, and a higher proportion of verbs, pronouns and ad-
     SYN2005 corpus [3]).
                                                                       verbs, which corresponds to the assumption that in fiction,
     Therefore, a new manually annotated treebank of fiction
                                                                       verbal expressions are preferred, whereas journalism tends
     texts was created; it was annotated according to the PDT
                                                                       to use more nominal expressions.
     a-layer guidelines. The scope of the new treebank is only
     about 11% of the PDT data, due to the difficulties of man-
     ual syntactic annotation, but even so, using this new re-         3    Annotation Procedure
     source does improve the parsing of fiction texts.
     In this article we present this new treebank, named Fic-          The FicTree treebank was syntactically annotated accord-
     Tree (Treebank of Czech fiction), its composition, and the        ing to the formalism of the analytical layer of the Prague
     annotation process. We describe the first experiments with        Dependency Treebank. The texts were lemmatized and
     parsers based on the data of FicTree and PDT. In the data         morphologically annotated using a hybrid system of rule-
     of the FicTree treebank parsed by four parsers, we investi-       based desambiguation [6] and stochastic tagger Featu-
182                                                                                                                                 T. Jelínek

                                                                      to accept. The adjudicator was not limited to the two man-
               Table 1: POS proportion in PDT and FicTree
                                                                      ually corrected versions, she was allowed to choose an-
                                                                      other solution consistent with the PDT annotation manual
                                            PDT        FicTree        and data. Some changes in tokenization and segmentation
                       Nouns                35.60         22.31       were also performed (159 cases, mainly sentence split or
                       Adjectives           13.72          7.73       merge). The adjudication took approximately five years of
                       Pronouns              7.68         16.42       work due to the difficulty of the task, the effort to maxi-
                       Numerals              3.83          1.53       mize the consistency of the same phenomenon across the
                       Verbs                14.34         23.16       treebank (and in accordance with PDT data), and other
                       Adverbs               6.18          9.19       workload with a higher priority.
                       Prepositions         11.39          9.14
                       Conjunctions          6.61          9.39       3.3 Accuracy of the Parsing and of the Manual
                       Particles             0.64          1.05           Corrections
                       Interjections         0.01          0.07
                                                                      In the following two tables, we will present the accuracy of
                       Total                  100              100    each step of annotation and the inter-annotator agreement.
                                                                      Table 2 shows to what extent the automatically parsed and
                                                                      the manually corrected versions of the text agree with the
                                                                      final syntactic annotation, first for the texts annotated with
      rama1 . The texts were then doubly parsed using two
                                                                      the MSTParser, then for the ones annotated with the Malt-
      parsers: MSTParser [9] and MaltParser [10] (the parsing
                                                                      Parser. Two measures of agreement with the final anno-
      took place several years ago when better parsers such as
                                                                      tation are shown: UAS (unlabeled attachment score, i. e.
      TurboParser [8] were not available) trained on the PDT
                                                                      the proportion of tokens with a correct head) and LAS (la-
      a-layer training data. The difference in the algorithms
                                                                      beled attachment score, i. e. the proportion of tokens with
      of both parsers ensured that the errors in the texts were
                                                                      a correct head and dependency label).
      distributed differently, it can be assumed that errors in
      the subsequent manual corrections will not be identical.
      According to Berzak [2] there are likely some deviations                 Table 2: Accuracy of annotated versions
      common for both parsers, which will also manifest in the
      final (manual) annotation, but this distortion of the data               UAS:auto. UAS:man. LAS:auto. LAS:man.
      could not be avoided.
                                                                        MST          83.37         96.92       75.31        95.03
                                                                        Malt         86.08         96.40       79.39        94.42
      3.1     Manual Correction of Parsing Results
      The automatically annotated data was then distributed to          It is clear from the table that due to the relatively low in-
      three annotators that checked one sentence after using the      put parsing quality, the annotators had to carry out a large
      TrEd software for manual treebank editing and corrected         number of manual interventions in the parsing correction
      the data. The two versions of the parsed text (parsed by the    process. The dependencies or labels were modified for
      MSTParser and by the MaltParser) were always assigned           15–20% of tokens. The manually corrected versions differ
      to two different annotators, we ensured that the combina-       much less from the final annotation, the disagreement is
      tions of parsers and annotators were varied. The data were      approx. 5% of the tokens.
      divided into 163 text parts of approx. 1000 tokens, every         Table 3 presents the agreement between the two auto-
      combination of parsers and annotators has occurred in at        matically parsed versions and the inter-annotator agree-
      least 10 text parts (the proportion of texts corrected by in-   ment (the agreement between the two manually corrected
      divudual annotators was 26%, 35% and 39%).                      versions). As in the previous table, we use the measures
      The task of the manual annotators was to correct syntactic      UAS and LAS.
      structure and syntactic labels, but they also had the possi-
      bility to suggest corrections of segmentation, tokenization     Table 3: Agreement between parsers and inter-annotator
      or morphological annotation and lemmatization.                  agreement

      3.2     Adjudication                                                                           UAS      LAS
      The two corrected versions of syntactic annotation from                         Parsers        83.48    75.66
      each text were merged, the resulting doubly annotated                           Annotators     93.89    90.26
      texts were examined by an experienced annotator (adju-
      dicator) who decided which of the proposed annotations
                                                                       The table shows that the agreement between the auto-
            1 See http://sourceforge.net/projects/featurama.          matically annotated versions is very similar to the agree-
FicTree: a Manually Annotated Treebank of Czech Fiction                                                                                           183

      ment between the final annotation and the worse of the two                    The results of the experiment with the UAS and LAS
      parsing results. After the manual corrections, the agree-                  scores for all parsers are approximately 2% worse for Fic-
      ment between the two versions of texts has increased con-                  Tree than for PDT, probably due to the genre differences
      siderably, but the difference is approximately twice the                   of FicTree versus PDT data. In the case of SENT, the Fic-
      difference between each of the manually corrected ver-                     Tree scores are comparable or better than the PDT etest,
      sions of texts and final syntactic markings. This fact shows               probably because the sentence length in FicTree is signifi-
      that the final annotation alternately used the solutions from              cantly lower, so there is a higher percentage of well-parsed
      both versions of the texts.                                                sentences.


      4      Parsing Experiments                                                 4.2 Training on PDT Data Combined with FicTree

      We conducted a series of experiments on PDT and FicTree                    In the second experiment, we split FicTree data into train-
      data. All data was automatically lemmatized and morpho-                    ing data (90%) and test data (10%) and combined the Fic-
      logically tagged using the MorphoDiTa tagger [12].2 We                     Tree training data with the PDT training data. This exper-
      used four parsers, two parsers of older generation, which                  iment was repeated three times with different distribution
      were used for the automatic annotation of FicTree data                     of the FicTree data, in order to achieve a more reliable re-
      (before manual corrections, with a different morphological                 sult (10% of FicTree is only 16,000 tokens). In that way,
      annotation and with other settings providing a better pars-                30% of FicTree has effectively been used as test data, the
      ing accuracy): MSTParser [9]3 and MaltParser [10];4 and                    parsers beeing trained on PDT training data plus each time
      two newer parsers: TurboParser [8]5 and Parsito [11].6 We                  90% of FicTree. It would have been better to use the whole
      use three measures: UAS (unlabeled attachment score),                      FicTree data in a 10-fold cross-validation experiment (al-
      LAS (labeled attachment score) and SENT (labeled attach-                   ways adding 90% of data to train PDT and testing the re-
      ment score for whole sentences, i. e. the proportion of sen-               maining 10% ), but we lacked the time and computational
      tences in which all tokens have correct heads and syntactic                resources to do so. Table 5 compares the results of parsers
      labels).                                                                   trained on the PDT training data itself and on these merged
                                                                                 data (train+ in the table), using PDT etest data and FicTree
                                                                                 test data. For each of the measures (UAS, LAS, SENT),
      4.1 Training on the PDT Data                                               the accuracy of the parser trained on the PDT training data
                                                                                 is always in one table column, in the following column,
      The first experiment was to compare the parsing of the
                                                                                 there is the accuracy measured for the parser trained on
      PDT test data (journalism) and the whole FicTree data (fic-
                                                                                 the combined training data (PDT and FicTree, train+). The
      tion) using parsers trained on PDT training data (journal-
                                                                                 average for the three experiments is shown.
      ism). The results of the experiment are shown in Table 4.
      Two following columns compare the results on the PDT
      etest and on the whole FicTree data.                                       Table 5: Accuracy of parsers trained on PDT train data
                                                                                 (train) and PDT&FicTree train data (train+)
          Table 4: Accuracy of parsers trained on PDT train data
                                                                                             UAS UAS         LAS LAS         SENT SENT
                   UAS UAS             LAS LAS              SENT SENT              Etest      train train+ train train+       train   train+
              etest FicTree etest FicTree etest FicTree                            MST       85.93 85.98 78.85 78.90         23.79     23.23
      MST 85.93 84.91 78.85 76.82 23.79 26.94                                      Malt      86.32 86.41 80.74 80.87         31.32     31.62
      Malt 86.32 85.01 80.74 77.94 31.32 31.86                                     Parsito   86.30 86.48 80.78 81.02         31.17     31.53
      Parsito 86.30 84.62 80.78 77.65 31.17 31.32                                  Turbo     88.27 88.34 81.79 81.89         27.74     27.93
      Turbo 88.27 86.66 81.79 79.06 27.74 29.61                                              UAS UAS         LAS LAS         SENT SENT
                                                                                   FicTree train train+ train train+          train   train+
                                                                                   MST     85.03 85.49 77.24 77.68           26.78     27.18
                                                                                   Malt    85.10 87.14 78.25 81.39           28.92     36.14
            2 Available on http://ufal.mff.cuni.cz/morphodita.                     Parsito 84.81 86.42 77.99 80.53           31.01     36.52
           3 Available on https://sourceforge.net/projects/mstparser/.
                                                                                   Turbo 87.00 88.35 79.69 81.69             29.12     34.92
      Used with the parameters: decode-type:non-proj order:2.
           4 Available on http://www.maltparser.org/.

      Used with the stacklazy algorithm, libsvm learner and a set of optimized
      features obtained with MaltOptimizer.                                         It is clear from the table that extending the training data
           5 Available on http://www.cs.cmu.edu/∼ark/TurboParser/.
                                                                                 by a part of the FicTree treebank is beneficial both for pars-
      Used with default options.
           6 Available on https://ufal.mff.cuni.cz/parsito.                      ing the PDT test data and for parsing FicTree data. The
      Used with hidden_layer= 400, sgd= 0.01,0.001, transition_system=           improvement in the parsing of the PDT etest is not statis-
      link2, transition_oracle= static.                                          tically significant (approximately 0.05% for UAS), but it
184                                                                                                                              T. Jelínek

      is consistent for all parsers and measures except for the       5.1 Examples of Differences between Manual
      measure SENT for the MSTParser.                                     Annotation and Parsing Results
         For the FicTree test data, we note a significant improve-
      ment in parsing, the increase in the measures is between
      0.4% and 2.5%. It is therefore clear that for the syntactic     The first example, a sentence fragment pohledy plné
      annotation of texts of fiction, the extension of the training   bezměrné důvěry, ‘regards full of unbounded trust’ dis-
      data by the FicTree training data is definitely beneficial.     played below, shows a typical example of wrong pars-
                                                                      ing result due to incorrect morphological annotation. The
                                                                      parsers agree on an erroneous interpretation of the syn-
      5   The Agreement of Parsers versus the                         tactic structure. After the tokens where dependencies or
          Manual Annotation                                           syntactic labels differ, we show the annotation (numbers
                                                                      indicate relative differencies, –1 means that the governing
      We also attempted to use the results of the parsing to as-
                                                                      node is positioned 1 to the left, +2 governing node is 2 to
      sess the quality of the manual annotation and adjudication
                                                                      the right; syntactic labels are shown if they differ).
      of the FicTree treebank. The whole FicTree data was an-
      notated by four parsers trained on the PDT training data.       Pohledy plné/–1/+2 bezměrné důvěry/Obj/-2/Atr/–3
      From these parsed data, we chose those cases where all          Regards full of unbounded trust
      four parsers agree on one dependency relation and / or syn-
      tactic function of a token, whereas the manual syntactic        Incorrect morphological tagging of the ambiguous form
      annotation is different. In total, parsers agreed for 70.04%    plné ‘full’ (which can formally agree both with the preced-
      of tokens in the FicTree data (78.12% if we only count de-      ing noun pohledy ‘regards’ and with the following noun
      pendencies without syntactic labels). 5.17% of all tokens       důvěry ‘trust’ in number, gender and case) led the parsers
      do not match manual annotation (3.43% of tokens with-           to ignore the valency characteristics of the adjective plný
      out syntactic labels). Table 6 shows 10 syntactic functions     ‘full’, they consider it to be the attribute of the follow-
      which occur most frequently in such cases of agreement          ing noun důvěry ‘trust’, which they interpret as a nomi-
      between four parsers and disagreement with manual an-           nal attribute of the preceding noun pohledy ‘regards’. The
      notation. In the first column, the syntactic label from the     manual annotation is correct, the adjective plný ‘full’ is
      manual annotation is shown. In the second column, we            dependent on the preceding noun pohledy ‘regards’, the
      present the proportion of disagreement in the tokens with       following noun důvěry ‘trust’ is an object of the adjective.
      this syntactic label, in the third column, there is the abso-   Similar differences in the attribution of the Adv and Obj
      lute number of occurrences.                                     syntactic labels and their dependency relations are com-
                                                                      mon, the manual annotation is in most cases correct (the
      Table 6: Syntactic labels where parsers agree with each         parsers agree on an erroneous syntactic structure).
      other but disagree with manual annotation                       In some cases, it is unclear whether the manual annota-
                                                                      tion or the parsing results are correct, as in the following
                                                                      sentence:
                    Synt. label   Ratio     Number
                    Adv            5.49       1135                    Doktorka/+6/+1 vychutnávala chvíli efekt svých slov a pak
                    Obj            6.20       1065                    pokračovala:
                    AuxX           6.08        618                    The doctor enjoyed for a while the effect of her words, and
                    Sb             5.64        561                    then went on:
                    ExD           13.96        543
                    AuxC          11.65        536                    The head of the subject Doktorka ‘doctor’ in manual an-
                    AuxP           4.05        501                    notation is the coordinating conjunction a ‘and’ which co-
                    Atr            1.76        339                    ordinates two verbs representing two clauses: vychutná-
                    AuxV           8.08        302                    vala ‘enjoyed’ and pokračovala ‘went on/continued’. The
                    AuxY          15.85        271                    subject is considered as a sentence member modifying the
                                                                      whole coordination (i. e. both verbs). However, all parsers
                                                                      agree on a different head: the verb vychutnávala ‘enjoyed’
         The data in the table shows that differences between         closest to the subject. In this interpretation, the second
      parsers and manual markup often occur with the Adv and          verb has a null subject (pro-drop). Both interpretations
      Obj syntactic labels (adverbial and object), since the anno-    are possible in the formalism of PDT, there is no strict
      tation performed by parsers often differs from the manual       rule indicating when the subject should modify coordi-
      annotation due to the difficulty of linguistic phenomena.       nated verbs and when it should depend on the closest verb
      Frequent differences between parsing results and manual         only. In the PDT data, both solutions are used. (The more
      annotations are discussed in more detail later, we will first   the structures in the coordinated sentences are similar and
      give two examples of such differences and their supposed        simple, the more likely it is that the subject will be com-
      reason.                                                         mon.).
FicTree: a Manually Annotated Treebank of Czech Fiction                                                                                        185

      5.2     Most Frequent Discrepancies between Parsing               Acknowledgement
              Results and Manual Annotation
                                                                        This paper, the creation of the data and the experiments
      In cases where dependencies between the manually as-              on which the paper is based have been supported by the
      signed one and the one on which the parsers agree are             Ministry of Education of the Czech Republic, through the
      different, the syntactic labels are usually the same. These       project Czech National Corpus, no. LM2015044.
      functions are mostly auxiliary functions: AuxV (auxiliary
      verbs), AuxP (prepositions) and AuxC (conjunctions) or
      are related to punctuation (AuxX, AuxK, AuxG). When               References
      the syntactic labels differ, the most frequent mismatches
                                                                         [1] T. Bartoň, V. Cvrček, F. Čermák, T. Jelínek, V. Petke-
      are Obj and Adv, Sb and Obj, Adv and Atr.
                                                                             vič: “Statistiky češtiny /Statistics of Czech”. NLN, Prague,
         The highest proportion of discrepancies between the
                                                                             2009.
      manually and automatically assigned functions is related
                                                                         [2] Y. Berzak, Y. Huang, A. Barbu, A. Korhonen, B. Katz:
      to the following functions: AuxO (46.5%), AuxR (21.9%),                “Bias and Agreement in Syntactic Annotations”, in Com-
      AuxY (15.9%), ExD (14.0%) and Atv (13.5%). AuxO                        puting Research Repository, 1605.04481, 2016.
      and AuxR refer to two possible syntactic functions of the          [3] F. Čermák, D. Doležalová-Spoustová, J. Hlaváčová, M.
      reflexive particles se/si ‘myself, yourself, herself. . . ’ de-        Hnátková, T. Jelínek, J. Kocek, M. Kopřivová, M. Křen,
      pending on context, for correct parsing, understanding of              R. Novotná, V. Petkevič, V. Schmiedtová, H. Skoumalová,
      semantics and use of lexicon would be necessary. The                   M. Šulc, Z. Velíšek: “SYN2005: a balanced corpus of writ-
      AuxY function covers particles and other auxiliary func-               ten Czech”. Institute of the Czech National Corpus, Prague,
      tions, ExD is a function which covers several different                2005. Available on-line: http://www.korpus.cz.
      phenomena in the PDT formalism and is difficult to parse           [4] J. Hajič: “Complex Corpus Annotation: The Prague De-
      automatically. None of these functions occur frequently in             pendency Treebank,” in Šimková M. (ed.): Insight into the
      the training data.                                                     Slovak and Czech Corpus Linguistics, pp. 54–73. Veda,
                                                                             Bratislava, Slovakia, 2006.
                                                                         [5] J. Hajič, J. Panevová, E. Buráňová, Z. Urešová, A. Bémová,
      5.3     Manual Analysis                                                J. Štepánek, P. Pajas, J. Kárník: “A manual for analytic
                                                                             layer tagging of the prague dependency treebank.” ÚFAL
      When we analyzed manually a sample of sentences in
                                                                             Internal Report, Prague, 2001.
      which four parsers agree on a dependency or syntactic la-
                                                                         [6] T. Jelínek, V. Petkevič: “Systém jazykového značkování
      bel different from the one chosen manually, we found out
                                                                             současné psané češtiny,” in Čermák F. (ed.): Korpusová
      that in 75% of cases, the manual annotation was certainly              lingvistika Praha 2011, vol. 3: Gramatika a značkování ko-
      correct, about 20% of the occurrencies could not be de-                rpusů, pp. 154-170. NLN, Prague, 2011.
      cided quickly due to the complexity of the construction, in
                                                                         [7] M. Křen, T. Bartoň, V. Cvrček, M. Hnátková, T. Jelínek, J.
      less than 5% of such occurrences the manual annotation                 Kocek, R. Novotná, V. Petkevič, P. Procházka, V. Schmied-
      was incorrect. It would certainly be useful to carefully               tová, H. Skoumalová: “SYN2010: a balanced corpus of
      check all cases of such discrepancy, it may reduce the er-             written Czech”. Institute of the Czech National Corpus,
      ror rate in FicTree data by about 0.2–0.5%, but for now we             Prague, 2010. Available on-line: http://www.korpus.cz.
      lack the resources to do so.                                       [8] A.F.T. Martins, M.B. Almeida, N.A. Smith: “Turning
                                                                             on the Turbo: Fast Third-Order Non-Projective Turbo
                                                                             Parsers,” in Proceedings of ACL 2013, 2013.
      6 Conclusion                                                       [9] R. McDonald, F. Pereira, K. Ribarov, J. Hajič: “Non-
                                                                             projective Dependency Parsing using Spanning Tree Algo-
      The new manually annotated treebank of Czech fiction                   rithms,” in Proceedings of EMNLP 2005, 2005.
      FicTree will allow for a better syntactic annotation of texts     [10] J. Nivre, J. Hall, J. Nilsson: “MaltParser: A Data-Driven
      of fiction when we add it to the PDT training data. Given              Parser-Generator for Dependency Parsing,” in Proceedings
      that larger training data were shown to be beneficial in               of LREC 2006, 2006.
      parsing journalistic texts as well, its use may be broader.       [11] M. Straka, J. Hajič, J. Straková, J. Hajič jr.: “Parsing Uni-
      We plan to publish the FicTree trebank in the Lindat /                 versal Dependency Treebanks using Neural Networks and
      CLARIN repository in the near future (after additional                 Search-Based Oracle,” in Proceedings of TLT 2015, 2015.
      checks of selected phenomena) and we would like to pub-           [12] J. Straková, M. Straka, J. Hajič: “Open-Source Tools for
      lish it later in the Universal Dependencies7 format, too,              Morphology, Lemmatization, POS Tagging and Named En-
      using publicly available conversion and verification tools.            tity Recognition,” in Proceedings of ACL 2014, 2014.




            7 See universaldependencies.org.