=Paper=
{{Paper
|id=Vol-1885/181
|storemode=property
|title=FicTree: A Manually Annotated
Treebank of Czech Fiction
|pdfUrl=https://ceur-ws.org/Vol-1885/181.pdf
|volume=Vol-1885
|authors=Tomáš Jelínek
|dblpUrl=https://dblp.org/rec/conf/itat/Jelinek17
}}
==FicTree: A Manually Annotated
Treebank of Czech Fiction==
J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 181–185
CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 T. Jelínek
FicTree: a Manually Annotated Treebank of Czech Fiction
Tomáš Jelínek
Charles Univeristy, Faculty of Arts,
Prague, Czech Republic
Tomas.Jelinek@ff.cuni.cz
Abstract: We present a manually annotated treebank of gate cases where all parsers agree on a syntactic annotation
Czech fiction, intended to serve as an addendum to the of one token which differs from the manual annotation.
Prague Dependency Treebank. The treebank has only
166,000 tokens, so it does not serve as a good basis for
training of NLP tools, but added to the PDT training data, 2 Composition of the Treebank
it can help improve the annotation of texts of fiction. We
The manually annotated treebank FicTree is composed of
describe the composition of the corpus, the annotation pro-
eight texts and longer fragments of texts from the genre of
cess including inter-annotator agreement. On the newly
fiction published in Czech from 1991 to 2007, with a total
created data and the data of the PDT, we performed a
of 166,437 tokens, 12,860 sentences. It is annotated ac-
number of experiments with parsers (TurboParser, Parsito,
cording to the PDT a-layer annotation guidelines [5]. The
MSTParser and MaltParser). We observe that the exten-
PDT data annotated on the analytical layer comprise, for
sion of PDT training data by a part of the new treebank
comparison, 1,503,739 tokens, 87,913 sentences. Seven of
actually does improve the results of the parsing of liter-
the eight texts which compose the FicTree treebank, were
ary texts. We investigate cases where parsers agree on a
included in the CNC corpus SYN2010 [7] (the eigth one
different annotation than the manual one.
was originally intended to be included in the SYN2010
corpus too, but was removed in the balancing process).
1 Introduction The size of the eight texts ranges from 4,000 to 32,000
tokens, the average is 20,800 tokens. Most of the texts are
written in original Czech (80%), the remaining 20% are
The Czech National Corpus (CNC) has decided to enrich
translations (from German and Slovak). Most of the texts
the annotation of some of its large synchronous corpora
belong to the fiction genre without any subgenre (accord-
by syntactic annotation, using the formalism of the Prague
ing to the classification of the CNC), one large text (18.2%
Dependency Treebank (PDT) [4]. The parsers used for
of all tokens) belongs to the subclass of memoirs, 5.9% to-
syntactic annotation must be trained on manually anno-
kens come from texts for children and youth.
tated data, with only PDT data available now. To achieve a
The language data included in the PDT and in FicTree
reliable parsing, it is necessary to ensure the training data
differ in many characteristics in a similar way to the dif-
to be as close as possible to the target text, but in PDT,
ferences between the whole genres of journalism and fic-
the texts are only journalistic, while one third of the texts
tion described above. In FicTree, there are significantly
in representative corpora of synchronous written Czech of
shorter sentences with an average of 12.9 tokens per sen-
the CNC belongs to the fiction genre. In many ways, fic-
tence compared to an average of 17.1 tokens per sentence
tion differs considerably from the characteristics of jour-
in PDT. The part-of-speech ratio is also significantly dif-
nalistic texts, for example by a significantly lower propor-
ferent, as shown in Table 1.
tion of nouns versus verbs: in the journalistic genre, 33.8%
tokens are nouns and 16.0% are verbs; in fiction, the ra-
It is evident from the table that there is a significantly
tio of nouns and verbs is almost equal, 24.3% tokens are
lower proportion of nouns, adjectives and numerals in Fic-
nouns, and 21.2% verbs (based on statistics [1] from the
Tree, and a higher proportion of verbs, pronouns and ad-
SYN2005 corpus [3]).
verbs, which corresponds to the assumption that in fiction,
Therefore, a new manually annotated treebank of fiction
verbal expressions are preferred, whereas journalism tends
texts was created; it was annotated according to the PDT
to use more nominal expressions.
a-layer guidelines. The scope of the new treebank is only
about 11% of the PDT data, due to the difficulties of man-
ual syntactic annotation, but even so, using this new re- 3 Annotation Procedure
source does improve the parsing of fiction texts.
In this article we present this new treebank, named Fic- The FicTree treebank was syntactically annotated accord-
Tree (Treebank of Czech fiction), its composition, and the ing to the formalism of the analytical layer of the Prague
annotation process. We describe the first experiments with Dependency Treebank. The texts were lemmatized and
parsers based on the data of FicTree and PDT. In the data morphologically annotated using a hybrid system of rule-
of the FicTree treebank parsed by four parsers, we investi- based desambiguation [6] and stochastic tagger Featu-
182 T. Jelínek
to accept. The adjudicator was not limited to the two man-
Table 1: POS proportion in PDT and FicTree
ually corrected versions, she was allowed to choose an-
other solution consistent with the PDT annotation manual
PDT FicTree and data. Some changes in tokenization and segmentation
Nouns 35.60 22.31 were also performed (159 cases, mainly sentence split or
Adjectives 13.72 7.73 merge). The adjudication took approximately five years of
Pronouns 7.68 16.42 work due to the difficulty of the task, the effort to maxi-
Numerals 3.83 1.53 mize the consistency of the same phenomenon across the
Verbs 14.34 23.16 treebank (and in accordance with PDT data), and other
Adverbs 6.18 9.19 workload with a higher priority.
Prepositions 11.39 9.14
Conjunctions 6.61 9.39 3.3 Accuracy of the Parsing and of the Manual
Particles 0.64 1.05 Corrections
Interjections 0.01 0.07
In the following two tables, we will present the accuracy of
Total 100 100 each step of annotation and the inter-annotator agreement.
Table 2 shows to what extent the automatically parsed and
the manually corrected versions of the text agree with the
final syntactic annotation, first for the texts annotated with
rama1 . The texts were then doubly parsed using two
the MSTParser, then for the ones annotated with the Malt-
parsers: MSTParser [9] and MaltParser [10] (the parsing
Parser. Two measures of agreement with the final anno-
took place several years ago when better parsers such as
tation are shown: UAS (unlabeled attachment score, i. e.
TurboParser [8] were not available) trained on the PDT
the proportion of tokens with a correct head) and LAS (la-
a-layer training data. The difference in the algorithms
beled attachment score, i. e. the proportion of tokens with
of both parsers ensured that the errors in the texts were
a correct head and dependency label).
distributed differently, it can be assumed that errors in
the subsequent manual corrections will not be identical.
According to Berzak [2] there are likely some deviations Table 2: Accuracy of annotated versions
common for both parsers, which will also manifest in the
final (manual) annotation, but this distortion of the data UAS:auto. UAS:man. LAS:auto. LAS:man.
could not be avoided.
MST 83.37 96.92 75.31 95.03
Malt 86.08 96.40 79.39 94.42
3.1 Manual Correction of Parsing Results
The automatically annotated data was then distributed to It is clear from the table that due to the relatively low in-
three annotators that checked one sentence after using the put parsing quality, the annotators had to carry out a large
TrEd software for manual treebank editing and corrected number of manual interventions in the parsing correction
the data. The two versions of the parsed text (parsed by the process. The dependencies or labels were modified for
MSTParser and by the MaltParser) were always assigned 15–20% of tokens. The manually corrected versions differ
to two different annotators, we ensured that the combina- much less from the final annotation, the disagreement is
tions of parsers and annotators were varied. The data were approx. 5% of the tokens.
divided into 163 text parts of approx. 1000 tokens, every Table 3 presents the agreement between the two auto-
combination of parsers and annotators has occurred in at matically parsed versions and the inter-annotator agree-
least 10 text parts (the proportion of texts corrected by in- ment (the agreement between the two manually corrected
divudual annotators was 26%, 35% and 39%). versions). As in the previous table, we use the measures
The task of the manual annotators was to correct syntactic UAS and LAS.
structure and syntactic labels, but they also had the possi-
bility to suggest corrections of segmentation, tokenization Table 3: Agreement between parsers and inter-annotator
or morphological annotation and lemmatization. agreement
3.2 Adjudication UAS LAS
The two corrected versions of syntactic annotation from Parsers 83.48 75.66
each text were merged, the resulting doubly annotated Annotators 93.89 90.26
texts were examined by an experienced annotator (adju-
dicator) who decided which of the proposed annotations
The table shows that the agreement between the auto-
1 See http://sourceforge.net/projects/featurama. matically annotated versions is very similar to the agree-
FicTree: a Manually Annotated Treebank of Czech Fiction 183
ment between the final annotation and the worse of the two The results of the experiment with the UAS and LAS
parsing results. After the manual corrections, the agree- scores for all parsers are approximately 2% worse for Fic-
ment between the two versions of texts has increased con- Tree than for PDT, probably due to the genre differences
siderably, but the difference is approximately twice the of FicTree versus PDT data. In the case of SENT, the Fic-
difference between each of the manually corrected ver- Tree scores are comparable or better than the PDT etest,
sions of texts and final syntactic markings. This fact shows probably because the sentence length in FicTree is signifi-
that the final annotation alternately used the solutions from cantly lower, so there is a higher percentage of well-parsed
both versions of the texts. sentences.
4 Parsing Experiments 4.2 Training on PDT Data Combined with FicTree
We conducted a series of experiments on PDT and FicTree In the second experiment, we split FicTree data into train-
data. All data was automatically lemmatized and morpho- ing data (90%) and test data (10%) and combined the Fic-
logically tagged using the MorphoDiTa tagger [12].2 We Tree training data with the PDT training data. This exper-
used four parsers, two parsers of older generation, which iment was repeated three times with different distribution
were used for the automatic annotation of FicTree data of the FicTree data, in order to achieve a more reliable re-
(before manual corrections, with a different morphological sult (10% of FicTree is only 16,000 tokens). In that way,
annotation and with other settings providing a better pars- 30% of FicTree has effectively been used as test data, the
ing accuracy): MSTParser [9]3 and MaltParser [10];4 and parsers beeing trained on PDT training data plus each time
two newer parsers: TurboParser [8]5 and Parsito [11].6 We 90% of FicTree. It would have been better to use the whole
use three measures: UAS (unlabeled attachment score), FicTree data in a 10-fold cross-validation experiment (al-
LAS (labeled attachment score) and SENT (labeled attach- ways adding 90% of data to train PDT and testing the re-
ment score for whole sentences, i. e. the proportion of sen- maining 10% ), but we lacked the time and computational
tences in which all tokens have correct heads and syntactic resources to do so. Table 5 compares the results of parsers
labels). trained on the PDT training data itself and on these merged
data (train+ in the table), using PDT etest data and FicTree
test data. For each of the measures (UAS, LAS, SENT),
4.1 Training on the PDT Data the accuracy of the parser trained on the PDT training data
is always in one table column, in the following column,
The first experiment was to compare the parsing of the
there is the accuracy measured for the parser trained on
PDT test data (journalism) and the whole FicTree data (fic-
the combined training data (PDT and FicTree, train+). The
tion) using parsers trained on PDT training data (journal-
average for the three experiments is shown.
ism). The results of the experiment are shown in Table 4.
Two following columns compare the results on the PDT
etest and on the whole FicTree data. Table 5: Accuracy of parsers trained on PDT train data
(train) and PDT&FicTree train data (train+)
Table 4: Accuracy of parsers trained on PDT train data
UAS UAS LAS LAS SENT SENT
UAS UAS LAS LAS SENT SENT Etest train train+ train train+ train train+
etest FicTree etest FicTree etest FicTree MST 85.93 85.98 78.85 78.90 23.79 23.23
MST 85.93 84.91 78.85 76.82 23.79 26.94 Malt 86.32 86.41 80.74 80.87 31.32 31.62
Malt 86.32 85.01 80.74 77.94 31.32 31.86 Parsito 86.30 86.48 80.78 81.02 31.17 31.53
Parsito 86.30 84.62 80.78 77.65 31.17 31.32 Turbo 88.27 88.34 81.79 81.89 27.74 27.93
Turbo 88.27 86.66 81.79 79.06 27.74 29.61 UAS UAS LAS LAS SENT SENT
FicTree train train+ train train+ train train+
MST 85.03 85.49 77.24 77.68 26.78 27.18
Malt 85.10 87.14 78.25 81.39 28.92 36.14
2 Available on http://ufal.mff.cuni.cz/morphodita. Parsito 84.81 86.42 77.99 80.53 31.01 36.52
3 Available on https://sourceforge.net/projects/mstparser/.
Turbo 87.00 88.35 79.69 81.69 29.12 34.92
Used with the parameters: decode-type:non-proj order:2.
4 Available on http://www.maltparser.org/.
Used with the stacklazy algorithm, libsvm learner and a set of optimized
features obtained with MaltOptimizer. It is clear from the table that extending the training data
5 Available on http://www.cs.cmu.edu/∼ark/TurboParser/.
by a part of the FicTree treebank is beneficial both for pars-
Used with default options.
6 Available on https://ufal.mff.cuni.cz/parsito. ing the PDT test data and for parsing FicTree data. The
Used with hidden_layer= 400, sgd= 0.01,0.001, transition_system= improvement in the parsing of the PDT etest is not statis-
link2, transition_oracle= static. tically significant (approximately 0.05% for UAS), but it
184 T. Jelínek
is consistent for all parsers and measures except for the 5.1 Examples of Differences between Manual
measure SENT for the MSTParser. Annotation and Parsing Results
For the FicTree test data, we note a significant improve-
ment in parsing, the increase in the measures is between
0.4% and 2.5%. It is therefore clear that for the syntactic The first example, a sentence fragment pohledy plné
annotation of texts of fiction, the extension of the training bezměrné důvěry, ‘regards full of unbounded trust’ dis-
data by the FicTree training data is definitely beneficial. played below, shows a typical example of wrong pars-
ing result due to incorrect morphological annotation. The
parsers agree on an erroneous interpretation of the syn-
5 The Agreement of Parsers versus the tactic structure. After the tokens where dependencies or
Manual Annotation syntactic labels differ, we show the annotation (numbers
indicate relative differencies, –1 means that the governing
We also attempted to use the results of the parsing to as-
node is positioned 1 to the left, +2 governing node is 2 to
sess the quality of the manual annotation and adjudication
the right; syntactic labels are shown if they differ).
of the FicTree treebank. The whole FicTree data was an-
notated by four parsers trained on the PDT training data. Pohledy plné/–1/+2 bezměrné důvěry/Obj/-2/Atr/–3
From these parsed data, we chose those cases where all Regards full of unbounded trust
four parsers agree on one dependency relation and / or syn-
tactic function of a token, whereas the manual syntactic Incorrect morphological tagging of the ambiguous form
annotation is different. In total, parsers agreed for 70.04% plné ‘full’ (which can formally agree both with the preced-
of tokens in the FicTree data (78.12% if we only count de- ing noun pohledy ‘regards’ and with the following noun
pendencies without syntactic labels). 5.17% of all tokens důvěry ‘trust’ in number, gender and case) led the parsers
do not match manual annotation (3.43% of tokens with- to ignore the valency characteristics of the adjective plný
out syntactic labels). Table 6 shows 10 syntactic functions ‘full’, they consider it to be the attribute of the follow-
which occur most frequently in such cases of agreement ing noun důvěry ‘trust’, which they interpret as a nomi-
between four parsers and disagreement with manual an- nal attribute of the preceding noun pohledy ‘regards’. The
notation. In the first column, the syntactic label from the manual annotation is correct, the adjective plný ‘full’ is
manual annotation is shown. In the second column, we dependent on the preceding noun pohledy ‘regards’, the
present the proportion of disagreement in the tokens with following noun důvěry ‘trust’ is an object of the adjective.
this syntactic label, in the third column, there is the abso- Similar differences in the attribution of the Adv and Obj
lute number of occurrences. syntactic labels and their dependency relations are com-
mon, the manual annotation is in most cases correct (the
Table 6: Syntactic labels where parsers agree with each parsers agree on an erroneous syntactic structure).
other but disagree with manual annotation In some cases, it is unclear whether the manual annota-
tion or the parsing results are correct, as in the following
sentence:
Synt. label Ratio Number
Adv 5.49 1135 Doktorka/+6/+1 vychutnávala chvíli efekt svých slov a pak
Obj 6.20 1065 pokračovala:
AuxX 6.08 618 The doctor enjoyed for a while the effect of her words, and
Sb 5.64 561 then went on:
ExD 13.96 543
AuxC 11.65 536 The head of the subject Doktorka ‘doctor’ in manual an-
AuxP 4.05 501 notation is the coordinating conjunction a ‘and’ which co-
Atr 1.76 339 ordinates two verbs representing two clauses: vychutná-
AuxV 8.08 302 vala ‘enjoyed’ and pokračovala ‘went on/continued’. The
AuxY 15.85 271 subject is considered as a sentence member modifying the
whole coordination (i. e. both verbs). However, all parsers
agree on a different head: the verb vychutnávala ‘enjoyed’
The data in the table shows that differences between closest to the subject. In this interpretation, the second
parsers and manual markup often occur with the Adv and verb has a null subject (pro-drop). Both interpretations
Obj syntactic labels (adverbial and object), since the anno- are possible in the formalism of PDT, there is no strict
tation performed by parsers often differs from the manual rule indicating when the subject should modify coordi-
annotation due to the difficulty of linguistic phenomena. nated verbs and when it should depend on the closest verb
Frequent differences between parsing results and manual only. In the PDT data, both solutions are used. (The more
annotations are discussed in more detail later, we will first the structures in the coordinated sentences are similar and
give two examples of such differences and their supposed simple, the more likely it is that the subject will be com-
reason. mon.).
FicTree: a Manually Annotated Treebank of Czech Fiction 185
5.2 Most Frequent Discrepancies between Parsing Acknowledgement
Results and Manual Annotation
This paper, the creation of the data and the experiments
In cases where dependencies between the manually as- on which the paper is based have been supported by the
signed one and the one on which the parsers agree are Ministry of Education of the Czech Republic, through the
different, the syntactic labels are usually the same. These project Czech National Corpus, no. LM2015044.
functions are mostly auxiliary functions: AuxV (auxiliary
verbs), AuxP (prepositions) and AuxC (conjunctions) or
are related to punctuation (AuxX, AuxK, AuxG). When References
the syntactic labels differ, the most frequent mismatches
[1] T. Bartoň, V. Cvrček, F. Čermák, T. Jelínek, V. Petke-
are Obj and Adv, Sb and Obj, Adv and Atr.
vič: “Statistiky češtiny /Statistics of Czech”. NLN, Prague,
The highest proportion of discrepancies between the
2009.
manually and automatically assigned functions is related
[2] Y. Berzak, Y. Huang, A. Barbu, A. Korhonen, B. Katz:
to the following functions: AuxO (46.5%), AuxR (21.9%), “Bias and Agreement in Syntactic Annotations”, in Com-
AuxY (15.9%), ExD (14.0%) and Atv (13.5%). AuxO puting Research Repository, 1605.04481, 2016.
and AuxR refer to two possible syntactic functions of the [3] F. Čermák, D. Doležalová-Spoustová, J. Hlaváčová, M.
reflexive particles se/si ‘myself, yourself, herself. . . ’ de- Hnátková, T. Jelínek, J. Kocek, M. Kopřivová, M. Křen,
pending on context, for correct parsing, understanding of R. Novotná, V. Petkevič, V. Schmiedtová, H. Skoumalová,
semantics and use of lexicon would be necessary. The M. Šulc, Z. Velíšek: “SYN2005: a balanced corpus of writ-
AuxY function covers particles and other auxiliary func- ten Czech”. Institute of the Czech National Corpus, Prague,
tions, ExD is a function which covers several different 2005. Available on-line: http://www.korpus.cz.
phenomena in the PDT formalism and is difficult to parse [4] J. Hajič: “Complex Corpus Annotation: The Prague De-
automatically. None of these functions occur frequently in pendency Treebank,” in Šimková M. (ed.): Insight into the
the training data. Slovak and Czech Corpus Linguistics, pp. 54–73. Veda,
Bratislava, Slovakia, 2006.
[5] J. Hajič, J. Panevová, E. Buráňová, Z. Urešová, A. Bémová,
5.3 Manual Analysis J. Štepánek, P. Pajas, J. Kárník: “A manual for analytic
layer tagging of the prague dependency treebank.” ÚFAL
When we analyzed manually a sample of sentences in
Internal Report, Prague, 2001.
which four parsers agree on a dependency or syntactic la-
[6] T. Jelínek, V. Petkevič: “Systém jazykového značkování
bel different from the one chosen manually, we found out
současné psané češtiny,” in Čermák F. (ed.): Korpusová
that in 75% of cases, the manual annotation was certainly lingvistika Praha 2011, vol. 3: Gramatika a značkování ko-
correct, about 20% of the occurrencies could not be de- rpusů, pp. 154-170. NLN, Prague, 2011.
cided quickly due to the complexity of the construction, in
[7] M. Křen, T. Bartoň, V. Cvrček, M. Hnátková, T. Jelínek, J.
less than 5% of such occurrences the manual annotation Kocek, R. Novotná, V. Petkevič, P. Procházka, V. Schmied-
was incorrect. It would certainly be useful to carefully tová, H. Skoumalová: “SYN2010: a balanced corpus of
check all cases of such discrepancy, it may reduce the er- written Czech”. Institute of the Czech National Corpus,
ror rate in FicTree data by about 0.2–0.5%, but for now we Prague, 2010. Available on-line: http://www.korpus.cz.
lack the resources to do so. [8] A.F.T. Martins, M.B. Almeida, N.A. Smith: “Turning
on the Turbo: Fast Third-Order Non-Projective Turbo
Parsers,” in Proceedings of ACL 2013, 2013.
6 Conclusion [9] R. McDonald, F. Pereira, K. Ribarov, J. Hajič: “Non-
projective Dependency Parsing using Spanning Tree Algo-
The new manually annotated treebank of Czech fiction rithms,” in Proceedings of EMNLP 2005, 2005.
FicTree will allow for a better syntactic annotation of texts [10] J. Nivre, J. Hall, J. Nilsson: “MaltParser: A Data-Driven
of fiction when we add it to the PDT training data. Given Parser-Generator for Dependency Parsing,” in Proceedings
that larger training data were shown to be beneficial in of LREC 2006, 2006.
parsing journalistic texts as well, its use may be broader. [11] M. Straka, J. Hajič, J. Straková, J. Hajič jr.: “Parsing Uni-
We plan to publish the FicTree trebank in the Lindat / versal Dependency Treebanks using Neural Networks and
CLARIN repository in the near future (after additional Search-Based Oracle,” in Proceedings of TLT 2015, 2015.
checks of selected phenomena) and we would like to pub- [12] J. Straková, M. Straka, J. Hajič: “Open-Source Tools for
lish it later in the Universal Dependencies7 format, too, Morphology, Lemmatization, POS Tagging and Named En-
using publicly available conversion and verification tools. tity Recognition,” in Proceedings of ACL 2014, 2014.
7 See universaldependencies.org.