=Paper=
{{Paper
|id=Vol-3033/paper78
|storemode=property
|title=Atypical or Underrepresented? A Pilot Study on Small Treebanks
|pdfUrl=https://ceur-ws.org/Vol-3033/paper78.pdf
|volume=Vol-3033
|authors=Akshay Aggarwal,Chiara Alzetta
|dblpUrl=https://dblp.org/rec/conf/clic-it/AggarwalA21
}}
==Atypical or Underrepresented? A Pilot Study on Small Treebanks==
Atypical or Underrepresented?
A Pilot Study on Small Treebanks
Akshay Aggarwal1 and Chiara Alzetta2
1. Twilio, Prague, Czechia
2. Istituto di Linguistica Computazionale “A.Zampolli”, CNR, Pisa - ItaliaNLP Lab
aaggarwal@twilio.com, chiara.alzetta@ilc.cnr.it
Abstract sistent annotation of linguistic phenomena under
a shared representation and across different lan-
We illustrate an approach for multilingual guages makes UD treebanks exceptionally well
treebanks explorations by introducing a suited for quantitative comparison of languages
novel adaptation to small treebanks of a (see, for example, Croft et al. (2017), Berdicevskis
methodology for identifying cross-lingual et al. (2018), Vylomova et al. (2020) and among
quantitative trends in the distribution of our works, Alzetta et al. (2019a) and Alzetta et al.
dependency relations. By relying on the (2020a)).
principles of cross-validation, we reduce
Despite their great relevance for linguistic in-
the amount of data required to execute
vestigations, large treebanks are available for only
the method, paving the way to expanding
a tiny fraction of the world’s languages (Vania et
its use to low-resources languages. We
al., 2019). Even within the UD project, around
validated the approach on 8 small tree-
60% of the treebanks can be considered small,
banks, each containing less than 100,000
i.e. containing less than 100,000 tokens. Tree-
tokens and representing typologically dif-
bank size, in fact, is generally identified as the
ferent languages. We also show prelim-
bottleneck for obtaining high-quality representa-
inary but promising evidence on the use
tive models of language use to be employed in
of the proposed methodology for treebank
downstream NLP applications. In general terms,
expansion.
larger datasets allow for better generalisations of
1 Introduction and Motivation language constructions, leading to better perfor-
mances of systems trained using such data (Zeman
Linguistically-annotated language resources like et al., 2018). In fact, ad-hoc strategies are gener-
treebanks are fundamental for developing reliable ally needed when dealing with low-resourced lan-
models to train and test tools used to address Nat- guages (Hedderich et al., 2021).
ural Language Processing (NLP) tasks acquiring
This paper illustrates a novel workflow specif-
linguistic evidence from corpora. Concerning the
ically designed to adapt an existing methodology
latter, researchers frequently rely on multilingual
for treebank exploration to small treebanks. The
or parallel resources in contrastive studies to quan-
base method, extensively described by Alzetta et
tify the similarities and differences between lan-
al. (2020b), relies on an unsupervised algorithm
guages (Jiang and Liu, 2018). Over the past few
called LISCA (LInguistically–driven Selection of
years, the Universal Dependencies (UD) initia-
Correct Arcs) (Dell’Orletta et al., 2013). LISCA
tive1 (Zeman et al., 2021) has further encouraged
has been successfully employed in past works
such studies. UD defines a universal inventory
for performing quantitative cross-lingual analyses
of categories and guidelines to facilitate consis-
(Alzetta et al., 2019a; Alzetta et al., 2019b; Alzetta
tent annotation of similar constructions across lan-
et al., 2020a) and error detection on UD treebanks
guages (Nivre, 2015; de Marneffe et al., 2021),
(Alzetta et al., 2017). The algorithm works in
and, at present, the project includes about 200 tree-
two main steps. First, it acquires evidence about
banks representing over 100 languages. The con-
language use from the distributions of phenomena
Copyright © 2021 for this paper by its authors. Use per- in annotated sentences. The algorithm then uses
mitted under Creative Commons License Attribution 4.0 In-
ternational (CC BY 4.0). such evidence to distinguish typical from atypical
1
https://universaldependencies.org constructions in an unseen set of sentences. The
typicality of a construction is determined with re- ties and differences between typologically differ-
spect to the examples observed in a corpus used ent languages. To this aim, we first validate the
as a reference, and is encoded with a score. This adaptation to the original LISCA approach pro-
score, in fact, reflects the probability of observing posed here in Section 3.1. Then, we exemplify
a dependency occurring in a given context (both how the obtained results can be employed for lin-
sentence-level and corpus-level) on the basis of the guistic investigations in Section 3.2. To improve
constructions sharing common properties reported the cross–linguistic comparability of the analy-
in the reference corpus. Hence, from our point of sis, we relied on Parallel UD (PUD) treebanks: a
view, typicality and frequency are tightly related collection of parallel treebanks developed for the
concepts, as non-standard constructions are also CoNLL–2017 Shared Task on multilingual pars-
usually less frequent in natural language use. ing (Zeman et al., 2017) and linguistically anno-
As such, the LISCA methodology relies on tated under the UD representation. Being parallel,
large sets of automatically parsed sentences to col- PUDs are particularly well suited for carrying out
lect the statistics about phenomena distributions: multilingual studies since they contain only 1,000
even if the data contains parsing errors2 , the cor- sentences manually translated from English into
pus size guarantees the collected statistics reflect the other languages, representing a perfect testbed
the actual language use. However, such an ap- for our approach.
proach can be employed only for analysing lan- Before concluding the paper in Section 5, we
guages for which large amounts of data are avail- report the results of preliminary investigations to
able, or at least for which the parser outputs are explore whether our approach could also be em-
generally considered reliable. To overcome such a ployed for automatically identifying underrepre-
limit, Aggarwal (2020) suggested that if the statis- sented phenomena in treebanks. Søgaard (2020)
tics are acquired from gold annotations (such as and Anderson et al. (2021) argue that some tree-
treebanks), the algorithm could collect the statis- banks cover only a restricted sample of the struc-
tics from fewer data since these resources are as- tures commonly used in a language, leaving out
sumed to be error-free. less common phenomena. This leakiness might af-
We implemented this proposal by adapting the fect the performances of NLP systems even more
original LISCA workflow as detailed in Section 2. than the system architecture. Thus, treebanks
Our variation to the original methodology is in- should be expanded not only to improve their rep-
spired by the k-fold approach commonly used for resentativeness but also to obtain more truthful
performing systems’ cross-validation: according performances of systems trained using them. Sec-
to this approach, a dataset is split into sub-sets tion 4 investigates if our methodology can con-
of equal size, iteratively used for training and/or tribute to this issue by exploring its application in
evaluating a system. We employ a similar strategy automatic treebank expansion.
for evaluating the typicality of the dependency re- The contributions of the paper can be listed
lations in each treebank split, acquiring the statis- as: (i) a novel approach specifically designed for
tics from the sentences contained in the other splits carrying out multilingual investigations on small
rather than from an external reference corpus. This treebanks; (ii) a case study involving eight typo-
small but substantial change in the method work- logically different languages to test the methodol-
flow allows us to apply the LISCA algorithm to ogy; and (iii) a novel formula, introduced in Sec-
small treebanks, which is particularly relevant in tion 3.2, to measure the distance between depen-
the case of analyses performed on low-resource dents and their syntactic head which improves the
languages. cross-lingual comparability of treebanks with re-
We tested the methodology in a case study, re- spect to such property.
ported in Section 3, involving 8 languages rep-
resented using UD treebanks. Our goal is to 2 Approach
test if our method can support linguistic inves-
tigations for exploring and quantifying similari- The method presented in this paper relies on a
methodology for treebank exploration based on
2
An assumption when producing automatically parsed the unsupervised algorithm LISCA (Dell’Orletta
data is that most of the errors made by a parser are consis-
tent. As we showed in (Alzetta et al., 2017), the LISCA-based et al., 2013), which we adapted to expand its usage
method allows to spot these errors types in annotations. for small treebanks, namely containing less than
100,000 tokens. merging k − 1 portions of the previously split tree-
As mentioned earlier, LISCA can be employed bank;
to quantify the typicality of each dependency re- 3) Use the obtained SM to compute the typical-
lation (hereafter deprel)3 of a linguistically anno- ity score of the deprels appearing in the remaining
tated corpus with respect to a large set of exam- treebank portion (i.e., the one not included in the
ples taken as reference (Alzetta et al., 2020b). To reference corpus);
achieve this goal, the algorithm first collects statis- 4) Repeat steps 2 and 3 until all k portions are
tics about linguistically motivated properties of analysed;
deprels extracted from a corpus of automatically 5) Merge the analysed portions and order the de-
parsed sentences (called reference corpus) to cre- prels by decreasing LISCA score to have a unique
ate a statistical model (SM). Then, the algorithm ranking of all the deprels in the treebank.
calculates a typicality score for each deprel ap- The ordered ranking of deprels can be explored
pearing in a test corpus relying on the SM while to investigate which linguistic constructions, rep-
also considering its linguistic context to assess the resented by means of the deprels, were marked
relevance of the dependency label used for mark- as typical or atypical, characterised by higher and
ing the dependency in the given context. When lower scores, respectively.
interpreting the assigned LISCA score, a deprel
marked by LISCA as highly typical was possibly 2.1 Data and Languages
frequently observed in similar contexts also in the We tested our method on a selection of Parallel
reference corpus. In contrast, an atypical deprel UD (PUD) treebanks (Zeman et al., 2017), each
could be characterised by certain properties which containing 1,000 sentences. In order to encom-
make it somehow distant from the other instances pass different language families and genera4 , we
of dependency marked with the same label in the carried out the case study on the following eight
reference corpus. languages: Arabic (AR; Afro-Asiatic, Semitic),
In essence, LISCA computes the score for a Czech (CZ; Indo-European, Slavic), English (EN;
given deprel taking into account local properties Indo-European, Germanic), Hindi (HI; Indo-
(e.g., dependency length and direction) of each de- European, Indic), Finnish (FI; Uralic, Finnic), In-
prel in the test corpus as well as the linguistic con- donesian (ID; Austronesian, Malayo-Sumbawan),
text where it is located (e.g., distance form root, Italian (IT; Indo-European, Romance) and Thai
leaves and number of siblings), comparing them (TH; Tai-Kadai, Kam-Tai).
both against the properties and contexts of all de-
pendencies annotated with the same dependency 3 Results
label in the reference corpus. For this reason, the
reference corpus has generally corresponded to a 3.1 Validating the Approach
large corpus of around 40M tokens: the corpus We report the results of an analysis to verify
size allows accounting for a more comprehensive whether the adapted and original LISCA-based
set of examples of linguistic constructions while methods return comparable results. To this aim,
also compensating for possible parser errors. we compared the LISCA ranking of PUD deprels
Workflow. For this study, we implemented the obtained using the original algorithm workflow,
adaptation of the LISCA workflow proposed by which employs a large reference corpus to build
Aggarwal (2020). Inspired by the k-fold valida- the language SM, and the novel workflow defined
tion approach, we modified the original approach above, which acquires the statistics from the tree-
as follows: bank itself. We carried out this analysis for Ital-
1) Split a treebank into k portions of equal size ian and English PUD treebanks. We manually
(k = 4 for this work), each containing the same verified in previous studies that the original ap-
number of sentences; proach applied to those languages allows captur-
2) Use LISCA to acquire the statistics (encoded ing elements of linguistic and parsing complexity
in the SM) about the distribution of linguistic 4
The language family and genus, reported between paren-
phenomena from a reference corpus obtained by thesis as (ISO language code, family, genus), are acquired
nsubj
from the World Atlas of Language Structures (WALS, avail-
3
Given a deprel A −−−−→ B, we refer to A − → B as the able online https://wals.info/languoid) (Dryer
dependency, with nsubj as the dependency label. and Haspelmath, 2013).
distinguishing between typical and atypical con-
structions along with the produced ranking of de-
prels (Alzetta et al., 2019a; Alzetta et al., 2020b).
We compared the deprel rankings obtained us-
ing the two methodology workflows in terms of Figure 1: LinkLengthAdjusted formula for nor-
Spearman correlation, which returns a rank cor- malising deprel length in multilingual compar-
relation coefficient indicating a statistical depen- isons. Note: ⌊·⌋ denotes floor function, while
dence between the rankings of two observed vari- [a, b] denotes closed interval over a and b.
ables. The analysis showed a strong and signif-
icant correlation between the rankings produced
relying on the two workflows in both languages. 3.2 Rankings Exploration
Specifically, we obtained a Spearman correlation This subsection exemplifies how the ranking of de-
coefficient of 0.95 (p < 0.5) for Italian and En- prels obtained with our adapted approach can be
glish. employed in linguistic analyses to identify sim-
ilarities and differences between languages. For
this case study, we focused on a specific property
Such high correlations confirm that gold cor- of deprels, namely the length of the dependency
pora, although small, can be used to acquire rel- link. The length of a deprel, measured as the linear
evant statistics about language use. Manually re- distance in terms of intervening tokens between
vised data might be limited in size. However, a word and its syntactic head, is a property fre-
their annotations are also generally correct in the quently explored in linguistically annotated cor-
case of rare phenomena, which a parser could pora. It is highly related to processing complexity
wrongly annotate due to their low frequency in in all languages (Demberg and Keller, 2008; Tem-
the data. While large reference corpora compen- perley, 2007; Futrell et al., 2015; Yu et al., 2019).
sate for the possibly wrong parses assigned to rare For example, McDonald and Nivre (2011) ob-
constructions with their size, small reference cor- served that parsers tend to make more mistakes on
pora shall compensate with consistency and cor- longer sentences and longer dependencies. Such
rectness. Hence, we could say that using gold data complexity makes this property particularly inter-
for building the SM allows reducing the number esting from a multilingual perspective, especially
of examples for acquiring language statistics. We when dealing with parallel corpora, as in our case
notice a difference between the two rankings only study.
when focusing on the bottom part, where we find We inspected the ranking of deprels to monitor
deprels with the lowest scores. While the origi- the LISCA score associated with deprels of differ-
nal method produces only a tiny number of deprels ent lengths and their distribution along the rank-
with LISCA score equal to 0, which we usually ex- ing of each language. To facilitate the rankings
cluded from the analyses, we observe many more exploration and comparison, we split each rank-
of them in the ranking produced with our work- ing into three portions of equal size, referred to
flow adaptation. LISCA score zero is assigned as top, middle and bottom, where top contains de-
to those dependencies never observed in the refer- prels obtaining the highest scores (more typical).
ence corpus; thus, their typicality is extremely low. In contrast, the bottom contains the deprels with
It is not surprising that smaller reference corpora the lowest scores (atypical).
produce a higher number of these cases, given In order to allow a proper multilingual compari-
their limited coverage. However, the high correla- son of the distribution of deprel lengths along with
tion coefficient reported above suggests that such the rankings, we defined the novel measure called
deprels are still interesting from a linguistic per- Adjusted Link Length (LLadjusted , cf. Figure 1).
spective. They correspond to rare constructions The measure, inspired by Brevity Penalty used in
in the language, obtaining a score slightly higher BLEU score (Papineni et al., 2002), is designed
than zero in the case of a larger reference corpus to compute the length of deprels involving content
but are still placed in the lower positions of the words as dependant while simultaneously improv-
ranking. ing cross-language comparability as the length of
Figure 2: Distribution of Adjusted Link Length on content words across LISCA Rankings.
a deprel is measured keeping in mind the over- order of a language has been shown to influence
all length of the sentence where it is located and the dependency length across major dependency
the average sentence length in the treebank. This types by Yadav et al. (2020).
way, instead of comparing absolute length values, It should be noted that such difference between
we can observe the tendency of languages towards languages could also be observed computing the
producing longer or shorter deprels. length of dependency relations straightforwardly
In LLadjusted , we operationally compute on PUD treebanks: the average linear link length
the length of deprels as a function of a) computed on Hindi PUD is 6.54, for Thai PUD,
the average sentence length in the treebank the language showing shorter relations, is 2.67,
(T rbAvgSentLen), b) the length of the sentence while the remaining languages show a value rang-
where the deprel appears (SentLength), and c) ing between 3.1 and 3.5. However, our method-
the distance, in tokens, between the dependent and ology allows us to combine multiple properties si-
its syntactic head (LLraw ). The formula’s values multaneously into a score, thus isolating in differ-
of 0.5 and 1.25 were determined empirically to ent portions of the rankings the deprels that show
account for unusually short and long sentences, an atypical value for a given property but could
respectively, in the treebank. Thus, the result- be still considered quite typical for the language
ing value associated with each deprel denotes it based on their context. As proof, observe that long
as ‘long’, ‘medium’ or ‘short’ with respect to the and medium deprels in Hindi tend to appear earlier
average deprel length computed in the treebank. in the ranking than in other languages: 19.73% of
Note that, although our analysis focuses on con- deprels located in the middle bin are covered by
tent words, function words are still accounted for medium and long deprels, suggesting that longer
when computing the LISCA score as they might deprels are more common in Hindi. On the con-
be part of the context of content words. trary, only 7% of deprels of the middle bin are
Figure 2 displays the distribution of deprels long in Thai, pointing to their atypicality in the
of different lengths (computed using LLadjusted ) language.
along the portions of the treebank ranking of The above results show the methodology’s ef-
each language. The distributions show that longer fectiveness for exploring tendencies and peculiari-
deprels are given a lower plausibility score by ties of languages in multilingual studies. However,
LISCA in all languages. Interestingly, the length small samples like PUD treebanks are usually not
distributions are pretty similar across different suited for analysing infrequent phenomena (Taher-
languages except for Hindi. Such difference doost, 2016). Hence, one might wonder if we are
could be due to the typical word order of con- actually capturing the atypicality of linguistic con-
stituents of the considered languages. Hindi, structions, or instead, we are biased by phenomena
in fact, is the only language of our set where underrepresented in the treebank. In the follow-
the order of the main constituents is of the type ing Section, we will explore whether low LISCA
S(ubject)O(bject)V(erb)5 , and the dominant word scores might be associated with infrequent linguis-
5
All the other languages are S(ubject)V(erb)O(bject) lan- tic phenomena due to under-representation in the
guages. data used to build the SM.
collected two test sets of 100 sentences each by
grouping sentences showing the highest and low-
est LISCA scores. Then, we trained UDPipe using
the remaining 800 sentences of PUD. The perfor-
mances of UDPipe on the test sets are reported in
terms of Labelled Attachment Score (LAS).
The results of this experiment are reported in
Figure 3. We observe that the test sets composed
of sentences characterised by the highest scores
are more accurately parsed than the lower-score
Figure 3: Parsing accuracy (LAS) on sentences
sets for all the languages involved. Differences
having high and low LISCA scores.
between languages in terms of overall Label At-
tachment Score (LAS) and between the two sub-
4 Towards Treebank Expansion groups of sentences will be further investigated in
future work. Such results complement the deprel-
Our analyses started from the premise that PUD level analysis: they suggest that the methodology
treebanks are error-free. Therefore we can look at could isolate difficult-to-parse sentences, and not
the rankings as containing correctly annotated ex- only deprels, that could be employed to expand
amples of language use. However, the approach treebanks.
employed in this study does not exclude the sce-
Treebank expansion is extremely valuable for
nario that a deprel might obtain a low LISCA score
low-resourced languages and small resources in
because of a lack of similar constructions in the
general as it allows to include unseen exam-
treebank. We explored this idea both at deprel and
ples to treebanks. Our results suggest that the
sentence level, as described below.
sentence suites collected by grouping sentences
Concerning the deprel–level analysis, we tested characterised by the lowest LISCA scores con-
the accuracy of a parser for deprels in the three tain difficult-to-parse constructions, possibly un-
portions of the LISCA rankings. To this aim, we derrepresented in PUD, that should be included in
parsed each PUD treebanks using UDPipe (Straka the treebank to improve its representativeness.
et al., 2016), relying on the k-fold approach used
to train LISCA: we split each PUD into 4 por-
5 Conclusion
tions of 250 sentences each, trained UDPipe with
3
4 of the portions and parsed the remaining por- We proposed a novel workflow to adapt an ex-
tion. Then, we checked if deprels were parsed ac- isting approach for treebank exploration to small
curately. Again, we excluded function words from treebanks and low-resourced languages. Results
this analysis to improve cross-language compara- of our analyses showed the effectiveness of the
bility and avoid biased results as function words methodology in multiple scenarios. First, the
are usually more accurately parsed than content adapted method allows obtaining reliable results
words. We observed that wrongly parsed deprels on par with the original method workflow when
mainly concentrate in the bottom bins for all lan- performing linguistic explorations of the tree-
guages based on the obtained results. This sug- banks. Secondly, the results also show the po-
gests that there might be a relationship between tential of the method for automatically identify-
low LISCA scores and underrepresented phenom- ing underrepresented constructions in treebanks.
ena. The latter result paves the way for the automatic
For the sentence-level analysis, we computed identification of cases required to expand the tree-
the LISCA score for each sentence in all PUD tree- banks, which we plan to further investigate in fu-
banks as the arithmetic mean of the scores of the ture work.
individual deprels belonging to the sentence to get
a sentence–level LISCA score. In the analysis, Acknowledgments
we explored whether sentences with low average
LISCA scores are also more difficult to parse than We would like to sincerely thank the anonymous
those with higher average LISCA scores. Having reviewers for their helpful comments.
computed the sentence–level LISCA scores, we
References Marie-Catherine de Marneffe, Christopher D Manning,
Joakim Nivre, and Daniel Zeman. 2021. Uni-
Akshay Aggarwal. 2020. Consistency of Linguis- versal dependencies. Computational linguistics,
tic Annotation. Master’s thesis, Univerzita Karlova 47(2):255–308.
(ÚFAL), Prague, Czechia, September. Thesis Su-
pervisor Zeman, Daniel. Felice Dell’Orletta, Giulia Venturi, and Simonetta
Montemagni. 2013. Linguistically-driven Selec-
Chiara Alzetta, Felice Dell’Orletta, Simonetta Monte- tion of Correct Arcs for Dependency Parsing. Com-
magni, and Giulia Venturi. 2017. Dangerous Rela- putación y Sistemas, 17(2):125–136.
tions in Dependency Treebanks. In Proceedings of
the 16th International Workshop on Treebanks and Vera Demberg and Frank Keller. 2008. Data from eye-
Linguistic Theories, pages 201–210, Prague, Czech tracking corpora as evidence for theories of syntactic
Republic. processing complexity. Cognition, 109(2):193–210.
Chiara Alzetta, Felice Dell’Orletta, Simonetta Mon- Matthew S. Dryer and Martin Haspelmath, editors.
temagni, and Giulia Venturi. 2019a. Inferring 2013. WALS Online. Max Planck Institute for Evo-
quantitative typological trends from multilingual lutionary Anthropology, Leipzig.
treebanks. A case study. Lingue e linguaggio,
Richard Futrell, Kyle Mahowald, and Edward Gibson.
18(2):209–242.
2015. Large-scale evidence of dependency length
minimization in 37 languages. Proceedings of
Chiara Alzetta, Felice Dell’Orletta, Simonetta Mon- the National Academy of Sciences, 112(33):10336–
temagni, and Giulia Venturi. 2019b. Inferring 10341.
quantitative typological trends from multilingual
treebanks. A case study. Lingue e Linguaggio, Michael A. Hedderich, Lukas Lange, Heike Adel, Jan-
XVIII(2):209–242. nik Strötgen, and Dietrich Klakow. 2021. A Survey
on Recent Approaches for Natural Language Pro-
Chiara Alzetta, Felice Dell’Orletta, Simonetta Monte- cessing in Low-Resource Scenarios. In Proceed-
magni, Petya Osenova, Kiril Simov, and Giulia Ven- ings of the 2021 Conference of the North Ameri-
turi. 2020a. Quantitative Linguistic Investigations can Chapter of the Association for Computational
across Universal Dependencies treebanks. In Pro- Linguistics: Human Language Technologies, pages
ceedings of the Seventh Italian Conference on Com- 2545–2568, Online, June. Association for Computa-
putational Linguistics (CLiC-it), Bologna (online), tional Linguistics.
Italy, March.
Jingyang Jiang and Haitao Liu. 2018. Quantita-
Chiara Alzetta, Felice Dell’Orletta, Simonetta Monte- tive Analysis of Dependency Structures, volume 72.
magni, and Giulia Venturi. 2020b. Linguistically- Walter de Gruyter GmbH & Co KG.
driven Selection of Difficult-to-Parse Dependency
Structures. IJCoL. Italian Journal of Computational Ryan McDonald and Joakim Nivre. 2011. Analyzing
Linguistics, 6(6-2):37–60. and integrating dependency parsers. Computational
Linguistics, 37(1):197–230.
Mark Anderson, Anders Søgaard, and Carlos Gómez-
Rodríguez. 2021. Replicating and Extending "Be- Joakim Nivre. 2015. Towards a universal grammar for
cause Their Treebanks Leak": Graph Isomorphism, natural language processing. In International con-
Covariants, and Parser Performance. In Proceed- ference on intelligent text processing and computa-
ings of the 59th Annual Meeting of the Associa- tional linguistics, pages 3–16. Springer.
tion for Computational Linguistics and the 11th In- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
ternational Joint Conference on Natural Language Jing Zhu. 2002. Bleu: a method for automatic eval-
Processing (Volume 2: Short Papers), pages 1090– uation of machine translation. In Proceedings of the
1098. 40th annual meeting of the Association for Compu-
tational Linguistics, pages 311–318.
Aleksandrs Berdicevskis, Çağrı Çöltekin, Katharina
Ehret, Kilu von Prince, Daniel Ross, Bill Thomp- Anders Søgaard. 2020. Some Languages Seem Easier
son, Chunxiao Yan, Vera Demberg, Gary Lupyan, to Parse Because Their Treebanks Leak. In Proceed-
Taraka Rama, et al. 2018. Using Universal Depen- ings of the 2020 Conference on Empirical Methods
dencies in cross-linguistic complexity research. In in Natural Language Processing (EMNLP), pages
Proceedings of the Second Workshop on Universal 2765–2770.
Dependencies (UDW 2018), pages 8–17.
Milan Straka, Jan Hajic, and Jana Straková. 2016. UD-
William Croft, Dawn Nordquist, Katherine Looney, Pipe: trainable pipeline for processing CoNLL-U
and Michael Regan. 2017. Linguistic Typology files performing tokenization, morphological anal-
meets Universal Dependencies. In Proceedings of ysis, pos tagging and parsing. In Proceedings of
the 15th International Workshop on Treebanks and the tenth international conference on language re-
Linguistic Theories (TLT15), CEUR Workshop Pro- sources and evaluation (LREC 2016), pages 4290–
ceedings, pages 63–75. 4297.
Hamed Taherdoost. 2016. Sampling methods in Daniel Zeman, Jan Hajič, Martin Popel, Martin Pot-
research methodology; how to choose a sampling thast, Milan Straka, Filip Ginter, Joakim Nivre, and
technique for research. How to Choose a Sampling Slav Petrov. 2018. CoNLL 2018 shared task: Multi-
Technique for Research (April 10, 2016). lingual Parsing from Raw Text to Universal Depen-
dencies. In Proceedings of the CoNLL 2018 Shared
David Temperley. 2007. Minimization of dependency Task: Multilingual Parsing from Raw Text to Univer-
length in written English. Cognition, 105(2):300– sal Dependencies, pages 1–21, Brussels, Belgium,
333. October. Association for Computational Linguistics.
Clara Vania, Yova Kementchedjhieva, Anders Søgaard,
and Adam Lopez. 2019. A systematic comparison Daniel Zeman, Joakim Nivre, Mitchell Abrams,
of methods for low-resource dependency parsing on Elia Ackermann, Noëmi Aepli, Hamid Aghaei,
genuinely low-resource languages. In Proceedings Željko Agić, Amir Ahmadi, Lars Ahrenberg,
of the 2019 Conference on Empirical Methods in Chika Kennedy Ajede, Gabrielė Aleksandravičiūtė,
Natural Language Processing and the 9th Interna- Ika Alfina, Lene Antonsen, Katya Aplonova, An-
tional Joint Conference on Natural Language Pro- gelina Aquino, Carolina Aragon, Maria Jesus
cessing (EMNLP-IJCNLP), pages 1105–1116.
Aranzabe, Bilge Nas Arıcan, Hórunn Arnardót-
Ekaterina Vylomova, Edoardo M Ponti, Eitan Gross- tir, Gashaw Arutie, Jessica Naraiswari Arwidarasti,
man, Arya D McCarthy, Yevgeni Berzak, Haim Du- Masayuki Asahara, Deniz Baran Aslan, Luma
bossarsky, Ivan Vulić, Roi Reichart, Anna Korho- Ateyah, Furkan Atmaca, Mohammed Attia, Aitz-
nen, and Ryan Cotterell. 2020. Proceedings of iber Atutxa, Liesbeth Augustinus, Elena Badmaeva,
the Second Workshop on Computational Research in Keerthana Balasubramani, Miguel Ballesteros, Esha
Linguistic Typology. In Proceedings of the Second Banerjee, Sebastian Bank, Verginica Barbu Mititelu,
Workshop on Computational Research in Linguistic Starkaður Barkarson, Victoria Basmov, Colin Batch-
Typology. elor, John Bauer, Seyyit Talha Bedir, Kepa Ben-
goetxea, Gözde Berk, Yevgeni Berzak, Irshad Ah-
Himanshu Yadav, Ashwini Vaidya, Vishakha Shukla, mad Bhat, Riyaz Ahmad Bhat, Erica Biagetti,
and Samar Husain. 2020. Word Order Ty- Eckhard Bick, Agnė Bielinskienė, Kristín Bjar-
pology Interacts With Linguistic Complexity: A nadóttir, Rogier Blokland, Victoria Bobicev, Loïc
Cross-Linguistic Corpus Study. Cognitive science, Boizou, Emanuel Borges Völker, Carl Börstell,
44(4):e12822. Cristina Bosco, Gosse Bouma, Sam Bowman, Adri-
ane Boyd, Anouck Braggaar, Kristina Brokaitė,
Xiang Yu, Agnieszka Falenska, and Jonas Kuhn. 2019.
Aljoscha Burchardt, Marie Candito, Bernard Caron,
Dependency length minimization vs. word order
Gauthier Caron, Lauren Cassidy, Tatiana Caval-
constraints: an empirical study on 55 treebanks. In
canti, Gülşen Cebiroğlu Eryiğit, Flavio Massimil-
Proceedings of the First Workshop on Quantitative
Syntax (Quasy, SyntaxFest 2019), pages 89–97. iano Cecchini, Giuseppe G. A. Celano, Slavomír Čé-
plö, Neslihan Cesur, Savas Cetin, Özlem Çetinoğlu,
Daniel Zeman, Martin Popel, Milan Straka, Jan Ha- Fabricio Chalub, Shweta Chauhan, Ethan Chi,
jic, Joakim Nivre, Filip Ginter, Juhani Luotolahti, Taishi Chika, Yongseok Cho, Jinho Choi, Jayeol
Sampo Pyysalo, Slav Petrov, Martin Potthast, Fran- Chun, Alessandra T. Cignarella, Silvie Cinková,
cis Tyers, Elena Badmaeva, Memduh Gokirmak, Aurélie Collomb, Çağrı Çöltekin, Miriam Con-
Anna Nedoluzhko, Silvie Cinkova, Jan Hajic jr., nor, Marine Courtin, Mihaela Cristescu, Phile-
Jaroslava Hlavacova, Václava Kettnerová, Zdenka mon. Daniel, Elizabeth Davidson, Marie-Catherine
Uresova, Jenna Kanerva, Stina Ojala, Anna Mis- de Marneffe, Valeria de Paiva, Mehmet Oguz De-
silä, Christopher D. Manning, Sebastian Schuster, rin, Elvis de Souza, Arantza Diaz de Ilarraza,
Siva Reddy, Dima Taji, Nizar Habash, Herman Le- Carly Dickerson, Arawinda Dinakaramani, Elisa
ung, Marie-Catherine de Marneffe, Manuela San- Di Nuovo, Bamba Dione, Peter Dirix, Kaja Do-
guinetti, Maria Simi, Hiroshi Kanayama, Valeria de- brovoljc, Timothy Dozat, Kira Droganova, Puneet
Paiva, Kira Droganova, Héctor Martínez Alonso, Dwivedi, Hanne Eckhoff, Sandra Eiche, Marhaba
Çağrı Çöltekin, Umut Sulubacak, Hans Uszkor- Eli, Ali Elkahky, Binyam Ephrem, Olga Erina,
eit, Vivien Macketanz, Aljoscha Burchardt, Kim Tomaž Erjavec, Aline Etienne, Wograine Evelyn,
Harris, Katrin Marheinecke, Georg Rehm, Tolga Sidney Facundes, Richárd Farkas, Marília Fer-
Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran nanda, Hector Fernandez Alcalde, Jennifer Fos-
Yu, Emily Pitler, Saran Lertpradit, Michael Mandl, ter, Cláudia Freitas, Kazunori Fujita, Katarína Gaj-
Jesse Kirchner, Hector Fernandez Alcalde, Jana Str- došová, Daniel Galbraith, Marcos Garcia, Moa Gär-
nadová, Esha Banerjee, Ruli Manurung, Antonio denfors, Sebastian Garza, Fabrício Ferraz Gerardi,
Stella, Atsuko Shimada, Sookyoung Kwak, Gustavo Kim Gerdes, Filip Ginter, Gustavo Godoy, Iakes
Mendonca, Tatiana Lando, Rattima Nitisaroj, and Goenaga, Koldo Gojenola, Memduh Gökırmak,
Josie Li. 2017. Conll 2017 shared task: Multilin- Yoav Goldberg, Xavier Gómez Guinovart, Berta
gual parsing from raw text to universal dependen- González Saavedra, Bernadeta Griciūtė, Matias Gri-
cies. In Proceedings of the CoNLL 2017 Shared oni, Loïc Grobol, Normunds Grūzı̄tis, Bruno Guil-
Task: Multilingual Parsing from Raw Text to Univer- laume, Céline Guillot-Barbance, Tunga Güngör,
sal Dependencies, pages 1–19, Vancouver, Canada, Nizar Habash, Hinrik Hafsteinsson, Jan Hajič, Jan
August. Association for Computational Linguistics. Hajič jr., Mika Hämäläinen, Linh Hà Mỹ, Na-
Rae Han, Muhammad Yudistira Hanifmuti, Sam ganathan Ramasamy, Carlos Ramisch, Fam Rashel,
Hardwick, Kim Harris, Dag Haug, Johannes Hei- Mohammad Sadegh Rasooli, Vinit Ravishankar,
necke, Oliver Hellwig, Felix Hennig, Barbora Livy Real, Petru Rebeja, Siva Reddy, Georg Rehm,
Hladká, Jaroslava Hlaváčová, Florinel Hociung, Pet- Ivan Riabov, Michael Rießler, Erika Rimkutė,
ter Hohle, Eva Huber, Jena Hwang, Takumi Ikeda, Larissa Rinaldi, Laura Rituma, Luisa Rocha, Eiríkur
Anton Karl Ingason, Radu Ion, Elena Irimia, O.lájídé Rögnvaldsson, Mykhailo Romanenko, Rudolf Rosa,
Ishola, Kaoru Ito, Tomáš Jelínek, Apoorva Jha, Valentin Ros, ca, Davide Rovati, Olga Rudina, Jack
Anders Johannsen, Hildur Jónsdóttir, Fredrik Jør- Rueter, Kristján Rúnarsson, Shoval Sadde, Pegah
gensen, Markus Juutinen, Sarveswaran K, Hüner Safari, Benoît Sagot, Aleksi Sahala, Shadi Saleh,
Kaşıkara, Andre Kaasen, Nadezhda Kabaeva, Syl- Alessio Salomoni, Tanja Samardžić, Stephanie Sam-
vain Kahane, Hiroshi Kanayama, Jenna Kanerva, son, Manuela Sanguinetti, Ezgi Sanıyar, Dage Särg,
Neslihan Kara, Boris Katz, Tolga Kayadelen, Jes- Baiba Saulı̄te, Yanin Sawanakunanon, Shefali Sax-
sica Kenney, Václava Kettnerová, Jesse Kirchner, ena, Kevin Scannell, Salvatore Scarlata, Nathan
Elena Klementieva, Arne Köhn, Abdullatif Kök- Schneider, Sebastian Schuster, Lane Schwartz,
sal, Kamil Kopacewicz, Timo Korkiakangas, Na- Djamé Seddah, Wolfgang Seeker, Mojgan Ser-
talia Kotsyba, Jolanta Kovalevskaitė, Simon Krek, aji, Mo Shen, Atsuko Shimada, Hiroyuki Shi-
Parameswari Krishnamurthy, Oğuzhan Kuyrukçu, rasu, Yana Shishkina, Muh Shohibussirri, Dmitry
Aslı Kuzgun, Sookyoung Kwak, Veronika Laippala, Sichinava, Janine Siewert, Einar Freyr Sigurðs-
Lucia Lam, Lorenzo Lambertino, Tatiana Lando, son, Aline Silveira, Natalia Silveira, Maria Simi,
Septina Dian Larasati, Alexei Lavrentiev, John Lee, Radu Simionescu, Katalin Simkó, Mária Šimková,
Phương Lê Hồng, Alessandro Lenci, Saran Lertpra- Kiril Simov, Maria Skachedubova, Aaron Smith,
dit, Herman Leung, Maria Levina, Cheuk Ying Li, Isabela Soares-Bastos, Carolyn Spadine, Rachele
Josie Li, Keying Li, Yuan Li, KyungTae Lim, Bruna Steingrímsson, Antonio Stella,
Sprugnoli, Steinhór
Lima Padovani, Krister Lindén, Nikola Ljubešić, Milan Straka, Emmett Strickland, Jana Strnadová,
Olga Loginova, Andry Luthfi, Mikko Luukko, Alane Suhr, Yogi Lesmana Sulestio, Umut Su-
Olga Lyashevskaya, Teresa Lynn, Vivien Macke- lubacak, Shingo Suzuki, Zsolt Szántó, Dima Taji,
tanz, Aibek Makazhanov, Michael Mandl, Christo- Yuta Takahashi, Fabio Tamburini, Mary Ann C.
pher Manning, Ruli Manurung, Büşra Marşan, Tan, Takaaki Tanaka, Samson Tella, Isabelle Tellier,
Cătălina Mărănduc, David Mareček, Katrin Marhei- Marinella Testori, Guillaume Thomas, Liisi Torga,
necke, Héctor Martínez Alonso, André Mar- Marsida Toska, Trond Trosterud, Anna Trukhina,
tins, Jan Mašek, Hiroshi Matsuda, Yuji Mat- Reut Tsarfaty, Utku Türk, Francis Tyers, Sumire Ue-
sumoto, Alessandro Mazzei, Ryan McDonald, Sarah matsu, Roman Untilov, Zdeňka Urešová, Larraitz
McGuinness, Gustavo Mendonça, Niko Miekka, Uria, Hans Uszkoreit, Andrius Utka, Sowmya Va-
Karina Mischenkova, Margarita Misirpashayeva, jjala, Rob van der Goot, Martine Vanhove, Daniel
Anna Missilä, Cătălin Mititelu, Maria Mitrofan, van Niekerk, Gertjan van Noord, Viktor Varga,
Yusuke Miyao, AmirHossein Mojiri Foroushani, Ju- Eric Villemonte de la Clergerie, Veronika Vincze,
dit Molnár, Amirsaeid Moloodi, Simonetta Mon- Natalia Vlasova, Aya Wakasa, Joel C. Wallen-
temagni, Amir More, Laura Moreno Romero, berg, Lars Wallin, Abigail Walsh, Jing Xian Wang,
Giovanni Moretti, Keiko Sophie Mori, Shinsuke Jonathan North Washington, Maximilan Wendt,
Mori, Tomohiko Morioka, Shigeki Moro, Bjar- Paul Widmer, Seyi Williams, Mats Wirén, Chris-
tur Mortensen, Bohdan Moskalevskyi, Kadri Muis- tian Wittern, Tsegay Woldemariam, Tak-sum Wong,
chnek, Robert Munro, Yugo Murawaki, Kaili Alina Wróblewska, Mary Yako, Kayo Yamashita,
Müürisep, Pinkey Nainwani, Mariam Nakhlé, Naoki Yamazaki, Chunxiao Yan, Koichi Yasuoka,
Juan Ignacio Navarro Horñiacek, Anna Nedoluzhko, Marat M. Yavrumyan, Arife Betül Yenice, Ol-
Gunta Nešpore-Bērzkalne, Manuela Nevaci, Lương cay Taner Yıldız, Zhuoran Yu, Zdeněk Žabokrtský,
Nguyễn Thi., Huyền Nguyễn Thi. Minh, Yoshihiro Shorouq Zahra, Amir Zeldes, Hanzhi Zhu, Anna
Nikaido, Vitaly Nikolaev, Rattima Nitisaroj, Alireza Zhuravleva, and Rayan Ziane. 2021. Universal
Nourian, Hanna Nurmi, Stina Ojala, Atul Kr. Ojha, dependencies 2.8.1. LINDAT/CLARIAH-CZ dig-
Adédayo. Olúòkun, Mai Omura, Emeka Onwueg- ital library at the Institute of Formal and Applied
buzia, Petya Osenova, Robert Östling, Lilja Øvre- Linguistics (ÚFAL), Faculty of Mathematics and
lid, Şaziye Betül Özateş, Merve Özçelik, Arzu- Physics, Charles University.
can Özgür, Balkız Öztürk Başaran, Hyunji Hay-
ley Park, Niko Partanen, Elena Pascual, Marco
Passarotti, Agnieszka Patejuk, Guilherme Paulino-
Passos, Angelika Peljak-Łapińska, Siyao Peng,
Cenel-Augusto Perez, Natalia Perkova, Guy Per-
rier, Slav Petrov, Daria Petrova, Jason Phelan, Jussi
Piitulainen, Tommi A Pirinen, Emily Pitler, Bar-
bara Plank, Thierry Poibeau, Larisa Ponomareva,
Martin Popel, Lauma Pretkalnin, a, Sophie Prévost,
Prokopis Prokopidis, Adam Przepiórkowski, Tiina
Puolakainen, Sampo Pyysalo, Peng Qi, Andriela
Rääbis, Alexandre Rademaker, Taraka Rama, Lo-