<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Quantitative Linguistic Investigations across Universal Dependencies Treebanks</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Chiara Alzetta</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sofia University</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>The paper illustrates a case study aimed at identifying cross-lingual quantitative trends in the distribution of dependency relations in treebanks for typologically different languages. Preliminary results show interesting differences rooted either in language-specific peculiarities or crosslingual annotation inconsistencies, with a potential impact on different application scenarios. 1 1 Introduction and Motivation</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The identification of cross-lingual quantitative
trends in the distribution of dependency relations
in “gold” treebanks is increasingly attracting the
interest of the computational linguistics
community for different purposes, as testified e.g. by a
recently published miscellaneous book on the
quantitative analysis of dependency structures
        <xref ref-type="bibr" rid="ref14 ref5">(Jiang
and Liu, 2018)</xref>
        or pilot initiatives such as the
first edition of the workshop “Quantitative
Syntax 2019”2. Among possible applications, it is
worth mentioning studies aimed at acquiring
typological evidence to be integrated in
multilingual NLP algorithms (s
        <xref ref-type="bibr" rid="ref18">ee Ponti et al. (2018</xref>
        ) for a
survey and the workshop “Typology for Polyglot
NLP”3), or at detecting annotation inconsistencies
to improve the quality of treebanks (see
        <xref ref-type="bibr" rid="ref11 ref9">(Dickinson, 2015; de Marneffe et al., 2017)</xref>
        to mention
only a few). While the latter is a well-established
research topic, although with still many open
issues, automatically acquiring typological
information is still at its beginning, so automatic
strategies to extract such information from corpora are
1Copyright c 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
      </p>
      <p>
        2https://www.aclweb.org/anthology/W19-79.pdf
3https://typology-and-nlp.github.io/
needed
        <xref ref-type="bibr" rid="ref14 ref5 ref8">(Cotterell and Eisner, 2017; Bjerva and
Augenstein, 2018)</xref>
        .
      </p>
      <p>
        Multilingual resources such as the
dependency treebanks developed within the
Universal Dependencies (UD) project4, thanks to the
cross-linguistically consistent syntactic annotation
        <xref ref-type="bibr" rid="ref17">(Nivre, 2015)</xref>
        , fostered the development of
automatic strategies to extract cross-lingual
similarities and differences in shared constructions from
corpora
        <xref ref-type="bibr" rid="ref15 ref6">(Murawaki, 2017; Bjerva et al., 2019)</xref>
        .
Within this line of research, the paper describes
a methodology for comparing treebanks of
typologically different languages with the final aim
of detecting and quantifying similarities and
differences in multilingual treebanks analyzed from
a twofold perspective: language–specific
peculiarities vs cross–lingual annotation
inconsistencies. To this end, we used LISCA (LInguiStically–
driven Selection of Correct Arcs)
        <xref ref-type="bibr" rid="ref10">(Dell’Orletta et
al., 2013)</xref>
        , an algorithm which has been
successfully applied in different scenarios, against both
the output of dependency parsers and gold
treebanks. In the first case, the score returned by
LISCA was meant to identify unreliable
automatically produced dependency relations
        <xref ref-type="bibr" rid="ref10">(Dell’Orletta
et al., 2013)</xref>
        . When used against gold
annotations, LISCA was used to detect shades of
syntactic markedness of syntactic constructions in
manually annotated corpora from a monolingual
perspective (Tusa et al., 2016), or to acquire
quantitative typological evidence from a multilingual
perspective
        <xref ref-type="bibr" rid="ref2 ref3">(Alzetta et al., 2018b)</xref>
        . Last but not least,
it was also exploited to identify anomalous
annotations (going from annotation inconsistencies to
errors) from a monolingual perspective in gold
treebanks
        <xref ref-type="bibr" rid="ref2 ref3">(Alzetta et al., 2018a)</xref>
        .
      </p>
      <p>The methodology exploited for the present work
(described in Section 2) was tested in a case study
carried out on four Indo-European languages
belonging to three different genera (according
4https://universaldependencies.org/
to WALS classification, Dryer and Haspelmath
(2013)): Bulgarian (Slavic, BUL), English
(Germanic, ENG), Italian and Spanish (Romance, ITA
and SPA). UD treebanks constitute an ideal test
bed for our analysis since, sharing the same
annotation scheme, allow the investigation of
crosslingual similarities and differences in shared
constructions. Besides similarities connected with
the UD annotation strategy aimed at maximising
parallelism across languages, results in Section
4 reflect shared possibly “universal” features of
languages. Differences, in turn, can either
reflect typologically relevant language peculiarities
or highlight inconsistencies in the application of
the shared annotation scheme. The paper focuses
on both aspects. Section 5 concludes the paper
discussing our findings and future directions of
research.</p>
      <p>Contribution. The present contribution has two
main goals: we aim to show how the
methodology can be used 1) to acquire quantitative evidence
of cross-linguistically shared properties, and 2) to
highlight divergences due either to language
idiosyncrasies or annotation inconsistencies across
treebanks.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Method</title>
      <p>As shown in Figure 1, our methodology for
exploring multilingual treebanks is articulated in the
following two steps.</p>
      <p>I) LISCA Analysis. The LISCA algorithm
operates in two steps: 1) it collects statistics about
a set of linguistically motivated features extracted
from an automatically dependency parsed corpus
(referred to as Reference Corpus) to build a
statistical model (SM) of the language; 2) it uses
the obtained SM to assign a score to each
dependency relation (DR) instance, defined as a triple
d(ependent), h(ead), t(ype) of dependency linking
d to h, in a Target Corpus. Borrowing a metaphor
from Jakobson (1973), we can look at the SM as
encoding the DNA of the language being analysed.
Note, in fact, that the features considered by the
LISCA algorithm to build the SM cover, for each
DR instance, a wide variety of factors, both local
and global. Local features include e.g. the
distance in terms of tokens between d and h, the
associative strength linking the grammatical categories
involved in the relation (i.e. POSd and POSh), the
POS of the head governor, the type of dependency
connecting d to h, and the relative linear order of
d and h in the sentence. Global features, instead,
are aimed at locating each DR within the overall
sentence structure, and include e.g. the distance
of d from the root of the dependency tree or from
the closest or most distant leaf node, and the
number of “brother” and “children” nodes of d,
occurring respectively to its right or left in the linear
sequence of words of the sentence. In this case
study, LISCA has been used in its delexicalized
version in order to abstract away from variations
resulting from lexical effects, thus guaranteeing
cross-lingual comparability of results. The output
of LISCA consists of the list of all DRs in the
Target Corpus ranked by decreasing score.</p>
      <p>The LISCA score is a context-sensitive and
frequency-based measure reflecting the degree
of similarity of the “linguistic environments” in
which a given DR occurs in the Reference and
Target corpora: it encodes the probability to
observe a DR instance occurring in a specific
context on the basis of the Statistical Model
constructed starting from the Reference Corpus. In
more abstract terms, the LISCA score can be seen
as reflecting the prototypicality degree of a
specific linguistic structure: whereas higher LISCA
scores identify DR instances appearing in
“typical” (more frequent and likely) contexts with
respect to the statistics acquired from the
Reference Corpus, lower scores identify less common or
even atypical DR instances of the Target Corpus.
From a multilingual perspective, the comparison
of the ranked DRs lists obtained from corpora of
different languages can shed light on similarities
and differences at linguistic and/or annotation
levels. To carry out this comparative analysis, in this
study the ranked list of DRs has been split into 20
intervals of equal size, henceforth “bins” (plus a
further bin for the remaining ones): the first bins
contain DRs presenting a high LISCA score and,
conversely, the last bins contain DRs associated
with low LISCA scores.</p>
      <sec id="sec-2-1">
        <title>II) Ranking Exploration. We exploited CLaRK</title>
        <p>
          system
          <xref ref-type="bibr" rid="ref20">(Simov et al., 2004)</xref>
          to identify and
compare quantitative trends from LISCA rankings.
CLaRK system work–flow is the following: firstly,
each Target Corpus is converted from the
CoNLLU format5 into XML format, then the XPath
language is used to select the nodes (sentences or
tokens) with the required properties. In this way we
5http://universaldependencies.org/
format.html
can define different configurations and check the
distribution of the node characteristics along the
DR rankings.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Data</title>
      <p>For each language taken into account, two
linguistically annotated corpora have been used: a large
Reference Corpus and a Target Corpus.</p>
      <p>
        Each Reference Corpus consists of a
monolingual corpus of texts from the news and Wikipedia
domains of around 40 million tokens,
constituting a set of examples large enough to reflect
the actual distribution of phenomena in the
specific language. Reference corpora were
morphosyntactically annotated and dependency parsed by
the UDPipe pipeline
        <xref ref-type="bibr" rid="ref22">(Straka et al., 2016)</xref>
        trained
on the Universal Dependency treebanks, version
2.2
        <xref ref-type="bibr" rid="ref16">(Nivre et al., 2017)</xref>
        .
      </p>
      <p>
        Target corpora correspond here to manually
validated (“gold”) Universal Dependencies
treebanks (v2.2). Specifically, we considered the
following UD treebanks:
i) English Web Treebank (254,830 tokens and
16,622 sentences)
        <xref ref-type="bibr" rid="ref19">(Silveira et al., 2014)</xref>
        ;
ii) Italian Stanford Dependency Treebank
(278,429 tokens and 14,167 sentences)
        <xref ref-type="bibr" rid="ref7">(Bosco et
al., 2013)</xref>
        ;
iii) Spanish UD treebank (547,680 tokens and
17,680 sentences)
        <xref ref-type="bibr" rid="ref1 ref22">(Alonso and Zeman, 2016)</xref>
        ;
iv) UD_Bulgarian-BTB (156,149 tokens and
11,138 sentences)
        <xref ref-type="bibr" rid="ref21">(Simov et al., 2005)</xref>
        .
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>Results are analysed from a twofold perspective,
focusing on the distribution across the bins of
different DR types and structures.
4.1</p>
      <sec id="sec-4-1">
        <title>Ranking of Dependencies</title>
        <p>As pointed out above, higher LISCA scores are
assigned to DRs that show a linguistic context highly
typical for the language, whereas low scores are
associated with atypical (or simply less typical)
syntactic structures; (un)typicality is assessed here
with respect to the statistics acquired from the
Reference Corpus.</p>
        <p>
          As a first step of our comparative analysis, for
each language we focused on the distribution of
individual DRs across the 20 LISCA bins.
Figure 2 reports the median bin of occurrence for all
29 shared DRs in the ranking of each language.
The median bin was selected by sorting all
instances of a given DR on the basis of the
associated LISCA score and by identifying the
median element of the ranked list: its bin of
occurrence was taken as representative of the relation.
Top and bottom relations (respectively at the
extreme left and right in Fig.2 graph) in
languagespecific rankings show interesting similarities: if
on the one hand DRs involving function words
(e.g. case, det, aux(:pass)) are associated
with higher LISCA scores for all languages, on
the other hand special or “loose” DRs such as
orphan and parataxis or clausal subjects and
adverbial clauses (csubj(:pass), advcl) all
occur in the last bins, representing relations with
more variable contexts across all languages.
Another cross-language parallelism concerns the
relative rankings of subsets of DRs: clausal
complements with obligatory control (xcomp) are
assigned a higher score with respect to the wider
class of clausal complements without it (ccomp);
the direct object relation (obj) precedes in the
ranking the oblique argument/modifier (obl); and
the nominal subject (nsubj) always precedes its
clausal counterpart (csubj). It is interesting to
report that the frequency of a DR seems to plays a
minor role in determining the position of a given
DR in the LISCA ranking: consider, for instance,
the punct relation which is a highly frequent
DR (covering around 11% of DRs in all four
languages), but nevertheless it was placed in the
middle part of the ranking for all languages. Looked
at from this perspective, the LISCA ranking of
relations - which is heavily influenced by the
principles underlying the UD annotation schema
seems to reflect the parsing complexity of
relations
          <xref ref-type="bibr" rid="ref4">(Alzetta et al., 2020)</xref>
          , where more complex
to parse DRs are characterised by a higher
variability in their contexts of occurrence.
        </p>
        <p>Some interesting differences can also be
reported, originating either in a) language-specific
peculiarities or b) possibly inconsistent
annotations across languages. Concerning a), ENG
nominal subjects (nsubj, nsubj:pass) are ranked
significantly higher with respect to the other three
languages, all sharing the pro-drop and free word
order properties; or determiners (det) show the
same distribution for SPA, ENG and ITA in
contrast to BUL, where the definite article is
postpositioned and expressed morphologically, with
the exception of some pronouns functioning as
determiners, e.g. demonstratives. Here are two
examples for Bulgarian where the first one shows
the usage of the morphologically expressed
postpositioned definite article (thus no explicit (det)
relation) while the second shows the usage of a
demonstrative pronoun (marked with (det)
relation)): (1) (‘Жената влезе в стаята’) (lit.
Woman-the entered room-the) and (2) (‘Тази
жена влезе в стаята’) (lit. This woman entered
room-the). The frequency of the examples type
(1) in the treebank is about 10 times bigger than
the frequency of the examples of type (2). Thus,
the nsubj nodes modified by explicit determiner
word is a rare case in Bulgarian treebank.</p>
        <p>With respect to b), there are interesting
examples, even among core UD DRs: this is the case of
indirect objects (iobj), whose annotation criteria
highly diverge across languages. The sources of
dissimilarities might come partially from the
annotation specifications per language about what a
second argument (iobj) vs an adjunct (obl) is.
If a closer look is taken into the data, it turns out
that in ITA and ENG the iobj is typically
expressed by a PRON(oun), as in these two
examples: ITA: ‘ti (PRON) ho dato’ (lit. ‘I gave you’);
ENG: ‘causing us (PRON) truble’. In ITA this
represents 100% of the cases, while in ENG 84%,
whereas in SPA and BUL this relation is expressed
by a pronoun in only 46.7% and 19% of the cases
respectively. In Spanish, for example, the iobj
relation is used also for NOUNs: in the
Spanish example ‘Obligaron al Gobierno (NOUN) a
comprar creditos’ (lit. Forced the Government to
buy credits) the noun is annotated as indirect
object of obligaron, whereas in Italian the
construction ‘Non ho dato soldi al presidente (NOUN)’
(lit. I didn’t give money to the president) the
noun is marked as obl relation. In Bulgarian
the iobj relation is used not only for marking
the dative pronouns, but also for marking head
NOUNs in PPs. The prevalence of this relation
on NOUNs is due to the following factors: (1)
the existence of long dative counterparts to short
dative pronouns that consist of a preposition and
a noun (‘Майката даде играчка на детето’)
(prep NOUN) (lit. Mother-the gave toy to
childthe-DAT); and (2) the marking of indirect
complements as indirect objects, while the obl
relation has been reserved for adjuncts (‘Те
продължават да участват в лотарията’) (non-dative
prep NOUN) (lit. They continue to participate in
lottery-the). This suggests that different
annotation criteria guide the assignment of the iobj DR,
possibly not all of them originating in peculiarities
of the language.</p>
        <p>Other interesting examples concern the
annotation of multi-word expressions and proper names
(fixed and flat), which are treated
differently across languages. For example, in BUL all
grammatically fixed multi-words, such as complex
prepositions (like с оглед на ‘with regard to’) or
conjunctions (like за да ‘in order to’), are treated
as fixed while in Italian the annotation reflects
the underlying syntactic structure, as in the case
of, e.g., ‘a base di’ (lit. made of ) and ‘in relazione
a’ (lit. in relation to).
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Distribution of Leaves</title>
        <p>For each language, we investigated the
distribution of DRs across the LISCA bins focusing on
DRs involving leaves as dependants (henceforth
leaves), as opposed to DRs without leave nodes
(henceforth non-leaves). Results of this
analysis are reported in Table 1. Despite minor
differences, all languages share a similar trend: leaves
are mostly ranked in the first 10 bins
representing for Bulgarian 91.52% of the DRs occurring in
them, 95.56% for English, 98,27% for Italian and
91.76% for Spanish. Interestingly, the first 6, 6,
8 and 4 bins respectively for Bulgarian, English,
Italian and Spanish contain exclusively leaves. In
other words, leaves are typically associated with
higher LISCA scores: due to their smaller
context, they are characterised by higher processing
reliability. This is in line with the fact that DRs
involving functional words, e.g. case, det, aux,
etc. typically occur in the first bins (see Figure
2). On the contrary, the last 10 bins of all
languages mostly contain DRs not involving leaves
(68.28% BUL, 63.54% EN, 69.33% ITA, 64.54%
SP). For what concerns the leaves in the second
half of the bins, they turned out to be typically
involved in particularly complex syntactic contexts,
such as long distance dependencies or occurring in
constructions that are not typical for that relation.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper we presented method for studying the
distribution of DRs in gold treebanks which was
tested in a case study carried out on four languages
belonging to three different genera. The
crosslingual comparison of the LISCA-based ranking
of UD relations across the bins shows: on the one
hand, shared (possibly universal) trends,
concerning e.g. the similar distribution of dependencies
involving leaves or of long distance dependencies,
which are respectively concentrated at the top and
at the bottom of the LISCA ranking for each
language; on the other hand, recorded differences in
the ranking of relations can be explained in terms
of either language peculiarities (e.g. the pro-drop
property of BUL-ITA-SPA vs ENG, or the
surface realisation of definite determiners in BUL vs
ENG-ITA-SPA) or potential inconsistencies in the
application of the UD annotation scheme (see the
case of the indirect object relation). Both types of
results play a potentially key role in different
scenarios, going from typology-driven multilingual
NLP to the improvement of the cross-lingual
consistency of treebanks.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>Thanks to the anonymous reviewers for their
helpful comments. This work was partially supported
by the Bulgarian National Interdisciplinary
Research e-Infrastructure for Resources and
Technologies in favor of the Bulgarian Language and
Cultural Heritage, part of the EU infrastructures
CLARIN and DARIAH – CLaDA-BG, Grant
numthe Tenth International Conference on Language
Resources and Evaluation (LREC).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>H. M. Alonso</surname>
            and
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Zeman</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Universal dependencies for the ancora treebanks</article-title>
          .
          <source>Procesamiento del Lenguaje Natural</source>
          ,
          <volume>57</volume>
          :
          <fpage>91</fpage>
          -
          <lpage>98</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>C.</given-names>
            <surname>Alzetta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Montemagni</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. Venturi.</surname>
          </string-name>
          2018a.
          <article-title>Dangerous relations in dependency treebanks</article-title>
          .
          <source>In Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories (TLT16)</source>
          , pages
          <fpage>201</fpage>
          -
          <lpage>210</lpage>
          , Prague, Czech Republic, January.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>C.</given-names>
            <surname>Alzetta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Montemagni</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. Venturi.</surname>
          </string-name>
          2018b.
          <article-title>Universal dependencies and quantitative typological trends. a case study on word order</article-title>
          .
          <source>In Proceedings of the 11th Edition of International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          ), pages
          <fpage>4540</fpage>
          -
          <lpage>4549</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>C.</given-names>
            <surname>Alzetta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Montemagni</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Venturi</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Uncovering typological contextsensitive features</article-title>
          .
          <source>In Proceedings of the Second Workshop on Typology for Polyglot Natural Language Processing.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Johannes</given-names>
            <surname>Bjerva</surname>
          </string-name>
          and
          <string-name>
            <given-names>Isabelle</given-names>
            <surname>Augenstein</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>From phonology to syntax: Unsupervised linguistic typology at different levels with language embeddings</article-title>
          .
          <source>In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (
          <issue>Long Papers)</issue>
          , pages
          <fpage>907</fpage>
          -
          <lpage>916</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Johannes</given-names>
            <surname>Bjerva</surname>
          </string-name>
          , Yova Kementchedjhieva, Ryan Cotterell, and
          <string-name>
            <given-names>Isabelle</given-names>
            <surname>Augenstein</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>A probabilistic generative model of linguistic typology</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>1529</fpage>
          -
          <lpage>1540</lpage>
          , Minneapolis, Minnesota, June. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Montemagni</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Simi</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Converting italian treebanks: Towards an italian stanford dependency treebank</article-title>
          .
          <source>In Proceedings of the ACL Linguistic Annotation Workshop &amp; Interoperability with Discourse</source>
          , Sofia, Bulgaria,
          <year>August</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Ryan</given-names>
            <surname>Cotterell</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jason</given-names>
            <surname>Eisner</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Probabilistic typology: Deep generative models of vowel inventories</article-title>
          .
          <source>In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          , pages
          <fpage>1182</fpage>
          -
          <lpage>1192</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>M.C. de Marneffe</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Grioni</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kanerva</surname>
            , and
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Ginter</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Assessing the Annotation Consistency of the Universal Dependencies Corpora</article-title>
          .
          <source>In Proceedings of the 4th International Conference on Dependency Linguistics (Depling</source>
          <year>2007</year>
          ), pages
          <fpage>108</fpage>
          -
          <lpage>115</lpage>
          , Pisa, Italy, September.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Venturi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Montemagni</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Linguistically-driven selection of correct arcs for dependency parsing</article-title>
          .
          <source>Computaciòn y Sistemas</source>
          ,
          <volume>2</volume>
          :
          <fpage>125</fpage>
          -
          <lpage>136</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Dickinson</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Detection of Annotation Errors in Corpora</article-title>
          .
          <source>Language and Linguistics Compass</source>
          ,
          <volume>9</volume>
          (
          <issue>3</issue>
          ):
          <fpage>119</fpage>
          -
          <lpage>138</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Matthew S.</given-names>
            <surname>Dryer</surname>
          </string-name>
          and Martin Haspelmath, editors.
          <year>2013</year>
          . WALS Online.
          <article-title>Max Planck Institute for Evolutionary Anthropology</article-title>
          , Leipzig.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Roman</given-names>
            <surname>Jakobson</surname>
          </string-name>
          .
          <year>1973</year>
          . Essais de linguistique générale t.
          <article-title>2: rapports internes et externes du langage</article-title>
          . Les éditions de Minuit.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Quantitative Analysis of Dependency Structures</article-title>
          . De Gruyter Mouton, Berlin, Boston.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Yugo</given-names>
            <surname>Murawaki</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Diachrony-aware induction of binary latent representations from typological features</article-title>
          .
          <source>In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)</source>
          , pages
          <fpage>451</fpage>
          -
          <lpage>461</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Nivre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Željko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lars</surname>
          </string-name>
          , and et alii.
          <year>2017</year>
          .
          <article-title>Universal dependencies 2.0</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>LINDAT</given-names>
          </string-name>
          /CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Nivre</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Towards a universal grammar for natural language processing</article-title>
          .
          <source>In Computational Linguistics and Intelligent Text Processing - Proceedings of the 16th International Conference, CICLing</source>
          <year>2015</year>
          ,
          <string-name>
            <surname>Part</surname>
            <given-names>I</given-names>
          </string-name>
          , pages
          <fpage>3</fpage>
          -
          <lpage>16</lpage>
          , Cairo, Egypt, April.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>E.M. Ponti</surname>
          </string-name>
          ,
          <string-name>
            <surname>H. O'Horan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Berzak</surname>
            ,
            <given-names>I. Vulic´</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Reichart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Poibeau</surname>
          </string-name>
          , E. Shutova,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Korhonen</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Modeling language variation and universals: A survey on typological linguistics for natural language processing</article-title>
          . arXiv preprint arXiv:
          <year>1807</year>
          .00914.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>N.</given-names>
            <surname>Silveira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Dozat</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.C. de Marneffe</surname>
            , S. Bowman,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Connor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Bauer</surname>
            , and
            <given-names>C.D.</given-names>
          </string-name>
          <string-name>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>A gold standard dependency corpus for english</article-title>
          .
          <source>In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC).</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>K.</given-names>
            <surname>Simov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Simov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ganev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ivanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.</given-names>
            <surname>Grigorov</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>The CLaRK System: XML-based Corpora Development System for Rapid Prototyping</article-title>
          .
          <source>Proceedings of LREC 2004</source>
          , pages
          <fpage>235</fpage>
          -
          <lpage>238</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>K.</given-names>
            <surname>Simov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Osenova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Simov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Kouylekov</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Design and Implementation of the Bulgarian HPSG-based Treebank</article-title>
          .
          <source>Journal of Research on Language and Computation. Special Issue</source>
          , pages
          <fpage>495</fpage>
          -
          <lpage>522</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Straka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hajic</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Strakova</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, pos tagging and parsing</article-title>
          . In Proceedings of E. Tusa,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Montemagni</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Venturi</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Dieci sfumature di marcatezza sintattica: verso una nozione computazionale di complessitá</article-title>
          .
          <source>In Proceedings of the Third Italian Conference on Computational Linguistics (CLiC-it)</source>
          , pages
          <fpage>3</fpage>
          -
          <lpage>16</lpage>
          , Napoli, Italy, December.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>