Bootstrapping Enhanced
                            Universal Dependencies for Italian

                     Maria Simi                             Simonetta Montemagni
              Dipartimento di Informatica           Istituto di Linguistica Computazionale
                  Università di Pisa                          “A. Zampolli” - CNR
              Largo B. Pontecorvo 3, Pisa                      Via Moruzzi 1, Pisa
                simi@di.unipi.it                 simonetta.montemagni@ilc.cnr.it


                                                       Marneffe and Manning, 2008): the so-called
                     Abstract                          “basic” representation where a close parallelism
                                                       to the source text is maintained (i.e. where each
    English. The paper presents an extension           word of the original sentence is present as a
    of the Italian Universal Dependencies              node), and the so-called “collapsed and propa-
    Treebank with an “enhanced” representa-            gated” representation which was conceived with
    tion level (e-IUDT), aimed at simplifying          a specific view to information extraction tasks.
    the information extraction process. The            Within the current version of UD, the “collapsed
    modules developed to semi-automatically            and propagated” representation has evolved into
    build e-IUDT were delexicalized to per-            the graph-based enhanced representation pro-
    form cross-language enhancements: pre-             posed by Schuster and Manning (2016).
    liminary experiments in this direction led         Since UD version 2.2 (officially released on July
    to promising results.                              2018), “enhanced treebanks” started to appear
                                                       for a limited number of languages, i.e. English,
    Italiano. L’articolo presenta l’estensione         Finnish, Russian, Polish, Dutch, Latvian. In or-
    della Universal Dependencies Treebank              der to foster the development of enhanced tree-
    italiana (e-IUDT) con un livello di rappre-        banks for other languages, transfer experiments
    sentazione arricchito (“enhanced”), fina-          exploiting existing treebanks are reported in the
    lizzato a rendere più efficiente ed efficace       literature, following both rule-based (Schuster
    il processo di estrazione dell’informazione.       and Manning 2016) and data-driven (Nyblom et
    I moduli sviluppati per la costruzione se-         al., 2013) approaches.
    mi-automatica della risorsa sono stati de-         This paper describes the approach we used for
    lessicalizzati e utilizzati per il trattamento     developing and validating the enhanced version
    di diverse lingue: esperimenti preliminari         of the Italian UD Treebank and reports the first
    in questa direzione mostrano risultati             results of transfer experiments to English.
    promettenti.
                                                       2    Enhanced dependencies
1    Introduction
                                                       Enhanced dependencies were proposed as a way
The Universal Dependencies (UD) project,               to simplify the process of information extraction.
launched in 2015, aims at developing cross-            Enhancements, for the most part, result in addi-
linguistically consistent treebank annotation for      tional links added to the dependency tree, moti-
many languages, with the goal of facilitating          vated by inferences, which remain however an-
multilingual parser development, cross-lingual         chored at the surface representation level. The
learning, and parsing research from a language         result of enhancing a dependency tree is a graph,
typology perspective (Nivre et al., 2016). UD          possibly with cycles, but not necessarily a super
represents an open community effort with over          graph (since some of the original arcs may be
200 contributors producing more than 100 tree-         discarded).
banks in over 60 languages.                            The current UD guidelines are quite conserva-
Starting from the Stanford Dependencies project,       tive, i.e. they suggest practically feasible en-
from which Universal Dependencies (UD) origi-          hancements only. Despite this, enhancements
nate, two syntactic representation options are         cannot always be achieved automatically, and the
made available, suited to different use cases (De      task is challenging enough to be interesting. Ac-
cording to the guidelines enhanced graphs may             according to whether dependents of the first con-
contain some or all of the following enhance-             junct are propagated or the head of the first con-
ments, described with particular emphasis on              junct is propagated instead. Figure 2 shows Ital-
Italian:                                                  ian examples for each case.
1. Added subject relations in control and raising
    constructions;                                            a)
2. Shared heads and dependents in coordination;
3. Co-reference in relative clause constructions;
4. Modifier specialization by means of case                      The book store buys and sells used books.
    markers;
5. Null nodes for elided predicates.
                                                            b)
2.1 Added subject relations
                                                                    The book store sells books and magazines
In the case of control and raising constructions,
                                                           Figure 2. a) Dependents propagation b) Head propagation
the subject of the subordinated non-finite clause
is added. Consider the following examples, with           2.3 Co-reference in relative clauses
controlled and raised subjects marked in bold:
1) Subject control: La mamma ha promesso a                In basic UD, relative pronouns are normally at-
    Maria di comprare il pane ‘The mother                 tached to the main predicate of the relative
    promised Maria to buy the bread’                      clause, typically as nominal subjects (nsubj) or
2) Object control: La mamma ha convinto Ma-               direct objects (obj). In the corresponding en-
    ria a comprare il pane ‘The mother convin-            hanced graph, the relative pronoun is linked to
    ced Maria to buy the bread’                           its antecedent with the ref relation and its de-
3) Oblique control: La mamma ha chiesto a                 pendency to the head of the relative clause is
    Maria di comprare il pane ‘The mother                 transferred to the antecedent itself, as exempli-
    asked Maria to buy the bread’                         fied in Figure 3 where it can be observed that the
4) Subject raising: La mamma sembra apprez-               resulting enhanced representation contains a cy-
    zare il pane integrale ‘The mother seems to           cle.
    like whole bread’
Figure 1 shows the UD representation of sen-
tence 3), where the added subject relation
(marked as nsubj:xsubj) is represented as an                               The book that I read
“enhanced arc” (in blue).                                                 Figure 3. Relative clauses

                                                          2.4 Specialization of relations
                                                          Adding case information to the relation name of
   Figure 1. Enhanced representation of oblique control   non-core dependents serves the purpose of dis-
                                                          ambiguating their semantic role. This infor-
Control and raising predicates are superficially          mation is expressed in terms of the preposition or
very similar, with a main difference: whereas             the subordinating conjunction introducing non-
Raising predicates have a ‘non-thematic’ argu-            core dependents. In particular: nmod and obl
ment, all arguments of Control predicates are             relation labels, respectively marking nominal and
‘thematic’. Such a distinction is neutralized in          oblique modifiers introduced by prepositions, are
the enhanced UD representation. In both cases,            augmented with language specific case infor-
however, the selection of the controlled/raised           mation; acl and advcl labels, corresponding
argument is lexically-driven.                             respectively to noun modifying clauses and ad-
                                                          verbial clauses, are augmented with markers in-
2.2 Sharing in coordination
                                                          troducing them. A similar type of specialization
Coordination is another major source of potential         also applies to the conj dependency label link-
enhancements, as information shared among con-            ing conjuncts in coordinated structures, which is
juncts is typically attached only to the first con-       specialized with respect to the conjunction type
junct and could be propagated to the other con-           (e, o, oppure …), as identified by the lemma of
juncts, where this is applicable. In propagating          the cc dependency (i.e. the relation between a
information, it is useful to distinguish two cases,
conjunct and a preceding coordinating conjunc-            a rare phenomenon in treebanks. Other cases of
tion).                                                    elision, such as subject elision, are much more
                                                          meaningful for Italian.
                                                          2.6 Open issues
        After having dinner he went home
                                                          Besides the standard enhancements foreseen for
   Figure 4. Adding case and mark information to labels   UD illustrated above, we are currently evaluating
                                                          cases that could be treated as such for Italian, and
2.5 Null nodes for elided predicates                      could possibly be relevant for other languages as
Special null nodes are added in clauses to stand          well. These include:
for a predicate which is elided; other cases of           • case information, which could also be added
ellipsis are not being dealt with in the current UD           for some core relations such as ccomp. Con-
guidelines due to major difficulties in their re-             sider as an example the following sentences:
construction. This type of enhancement occurs                 Non so se verrà domani ‘I don’t know
when the basic (i.e. pre-enhancement) tree con-               whether (he) will come tomorrow’ vs Non so
tains an orphan relation which in the enhanced                quando arriverà ‘I don’t know when (he)
graph is removed and replaced by the recon-                   will arrive’. Without enhancing the ccomp
structed explicit syntactic structure. A new null             relation, the semantics of the subordinated
node is added in place of the missing predicate               clause (conditional vs temporal) remains un-
and dependencies are redirected. Figure 5 shows               derspecified;
an example of predicate elision, along with the           • null nodes for elided subjects: Italian is a
enhanced version which introduces a new node                  pro-drop language and the omission of ex-
(labeled as E6.1) obtained as a copy of the token             plicit subjects occurs quite frequently in ac-
‘chiamava’.                                                   tual language usage; according to Bates
                                                              (1976), the pro-drop rate by adults is 70%.
                                                              The addition of null nodes for subject ellipsis
                                                              could significantly enhance the syntactic rep-
In intimacy she was calling him captain and he
                                                              resentation with a view to information ex-
[calling her] boss.
                                                              traction tasks.
                                                          The typology of representation enhancements
                                                          could also be further extended to neutralize di-
        Figure 5. Null nodes for elided predicates        athesis alternations, as proposed by Candito et al.
                                                          (2017) for French. In what follows, we focus on
This is the most problematic among the foreseen           the standard UD enhancements, excluding the
UD enhancements, due to several reasons such              treatment of predicate elision for which more
as: correct insertion points are difficult to antici-     careful investigation and detailed guidelines are
pate; phraseological verbs and verbs with clitics         required.
(either in pronominal form or with clitic com-                    Table 1. Guessing step: additional annotations
plements, see example in Figure 5) would require
                                                           ExtraSubjOf=id                  token id is head of a
copying a variable number of tokens (the verb                                              new arc to be added
                                                           RefOf=id
and the object with a shift in gender in the case at                                       to current token
                                                           PropagateDepTo=id
hand), which is not always easy to be identified;
                                                           PropagateHeadWith=label         label is the string
the appropriate syntactic role of the dependents
                                                           CaseSpec=label                  suggested to propa-
of the added (i.e. recovered) predicate must be                                            gate or to specialize a
                                                           MarkSpec=label
inferred by proper alignment with the dependents                                           relation
of the originally explicit predicate. Moreover, the        CcSpec=label
proposed UD treatment requires a major change
in the treebank format with the addition of new
                                                          3    Developing an enhanced UD gold
tokens with special labeling and numbering.                    treebank for Italian
Therefore, the introduction of null nodes calls for       UD enhanced representation cannot be generated
an ad hoc treatment and introduces a complexity           through a completely automatic process: this is a
in the processing of the treebank which is not            task that entails a global vision of the tree to be
fully justified if the aim is only to address the         completed and often requires additional linguis-
cases of predicate elision, for the fact that this is     tic knowledge concerning e.g. raising/control
properties and/or selectional preferences of pred-      ments contained in the developed resource,
icates. To build the enhanced Italian UD Tree-          which involve 21,75% of the words. Most of
bank (henceforth, e-IUDT), we followed a three-         them are represented by the specialization of
step approach, articulated as follows:                  modifiers and conjoining relations, immediately
1. Guessing: by making use of heuristics, a             followed by head propagation, relative clauses
    script suggests target nodes whose represen-        and extra-subjects. Interestingly enough, it can
    tation might be enhanced, e.g. the best extra       be noticed that the distribution of enhancements
    subject candidate(s) in raising/control con-        remains quite similar across different subsets of
    structions, or the heads/dependents to be           the same language (e.g. the development vs test
    propagated in coordinated constructions.            sets for Italian), whether manually revised (dev)
    During this step, additional annotations are        or not (test), or for another language, English.
    produced in the representation of involved
    tokens. For example, the annotation Ex-             4     A language-independent rule-based
    traSubjOf = j added to token i is an indica-              UD enhancer
    tion that i is an additional subject headed by
    j. In other cases, the additional annotation        Different cross-lingual techniques have been de-
    indicates a label to be used for specializing a     veloped for adding enhanced dependencies to
    given relation or whether a conjunct should         existing UD treebanks, both rule-based (Schuster
    be propagated. Table 1 summarizes the addi-         and Manning 2016) and data-driven (Nyblom et
    tional annotations used;                            al., 2013). The modularity of the approach pro-
2. Revising: the human annotator is called to           posed for e-IUDT construction created the pre-
    validate the proposed changes, automatically        requisites for reusing some of these components
    generated during the previous step;                 for implementing an UD enhancing module. In
3. Enhancing: validated additional annotations          what follows, we report preliminary results
    are used to automatically generate the en-          achieved by transforming the heuristics of the
    hanced UD representation. Enhancements              Guessing module into language-independent
    are not limited to retyping or addition of de-      ones. Instead of using language-specific lexical
    pendencies; in some cases, they involve the         information on raising/control properties of verbs
    reshaping of the dependency graph, and for          for identifying extra-subject candidates, follow-
    this reason an automatic transformation re-         ing the general UD strategy we used the heuristic
    duces the chances of occasional errors.             according to which the controlled / raised subject
                                                        of the embedded clause follows the obliqueness
The heuristics behind the guessing step make use        hierarchy, i.e. it is the object of the next higher
of lexical resources extracted from the corpus it-      clause, if there is one, or else its subject. Such a
self: this is the case, for example, of lexical in-     strategy was extended to foresee also oblique
formation on raising/control properties of predi-       complements as controlled / raised subjects. The
cates, guiding the identification of extra-subject      output of the Guessing module is directly passed
candidates.                                             to the Enhancing component. In order to test ef-
Following the three-step strategy sketched above,       fectiveness and generality of the approach we
we built a gold standard e-IUDT resource on top         tested the rule-based language-independent en-
of the development data set of the Italian UD           hancer on the Italian and English development
treebank (Release 2.2), constituted by 11,908           sets, both available as gold datasets.
tokens. In Table 2, the first two columns (headed
by “IT DEV (GOLD)”) summarize the enhance-
                                          Table 2. Enhanced relations

                                                    IT TEST                EN DEV           EN TEST
                            IT DEV (GOLD)          (SILVER)                 (GOLD)           (GOLD)
words                        11.908              10.417                 25.150            17.658
enhancements                  2.590 21,75%        2.275   21,84%         4.255 16,92%      3.595 20,36%
    xsubj                        69  2,66%           69    3,03%           342   8.04%       251  6,98%
    ref                         127  4,90%          210    9,23%           111   2,61%       274  7,62%
    conj specializations        322  12,4%          266    11,7%           810 19,03%        532 14,80%
    dep propagation*             45   1,7%           36     1,6%           165    3,9%       103  2,87%
    head propagation*           250   9,7%          230    10,1%           478   11,2%       413 11,49%
    other specializations     1.777  68,6%        1.464    64,4%         2.349     55%     2.022 56,24%
For evaluation, we used an adaptation of the                      gation of dependents or heads cannot always
evaluation script used in the evaluation campaign                 be easily carried out.
EVALITA 2014 (Bosco et al., 2014), which is                     An example follows where, without lexical in-
based on a set of relations extracted from the en-           formation, the identification of extra subjects
hanced graph and for each of them computes                   fails. Consider the sentence I carri armati … an-
Precision, Recall and F1. The evaluation focused             davano a Budapest … a spegnere i fuochi ‘The
on enhanced relations, thus allowing to analyze              tanks ... went to Budapest ... to extinguish the
the complexity of the task. Table 3 reports the              fires’. In UD, the obl relation covers both lexi-
results achieved with the following gold data                cally realized indirect objects and other oblique
sets: IT-dev, the development dataset from UD-               complements: however, without distinguishing
ISDT 2.2, enhanced as described above; EN-dev                between the two it is impossible to recover the
and EN-test, the development and test English                extra subject of the infinitive clause. A sugges-
datasets from UD-EWT 2.2.                                    tion could be to introduce a specialization of the
  Table 3. Precision, recall and F1 for enhanced relations   obl relation for identifying indirect objects.
                      UAS                    LAS             Dependency specialization turned out to be a
                P      R      F1      P       R       F1     challenging conversion case when applied to the
  IT-dev       99,7   99,8   99,8    99,5    99,6    99,6    English UD treebank: problems encountered
  EN-dev       98,2   99,3   98,8    96,2    97,2    96,7
                                                             were somehow unexpected, being mostly due to
  EN-test      99,2   99,0   99,0    97,8    97,6    97,6
                                                             a different strategy for annotating multi-word
   Table 4. Recall and Precision for enhancement type        case markers, not always compliant with the
                     IT-dev    EN-dev           EN-test      general UD annotation guidelines. This explains
                    R      P   R    P           R     P      the lower results reported in Table 3 for English
 xsubj             92,7 98,4 100,0 99,4        99,6 99,0
                                                             with respect to Italian.
 ref              100,0 100,0 99,1 86,6        99,3 94,4
 conj spec         99,7 100,0 98,2 94,9        97,9 97,6
 other specs       99,9 100,0 97,0 96,7        98,2 98,1     5   Conclusions
 propagation       97,8 95,7 97,1 97,3         95,5 98,2
                                                             We extended the Italian UD Treebank with an
                                                             enhanced representation level: Italian is now
For Italian, despite the de-lexicalization of the
                                                             among the few languages within UD with a gold
Guessing module, UAS and LAS results are
                                                             enhanced Treebank which will be part of Release
quite high. Results are very high also when en-
                                                             v2.3. The modules used to semi-automatically
hancement is carried out against different sets of
                                                             build e-IUDT were delexicalized to carry out
the English UD Treebank. A qualitative error
                                                             cross-language enhancements: preliminary re-
analysis was also performed. Table 4 details re-
                                                             sults for both Italian and English are promising.
call and precision achieved for the different types
                                                             The contribution also includes better and more
of enhancements, for both Italian and English.
                                                             detailed specifications to the constantly in-
The main sources of errors turned out to be:                 progress guidelines. Current developments in-
• the identification of extra-subjects, per-                 clude: from a mono-lingual perspective, exten-
     formed on the basis of heuristics rather than           sion of the typology of enhancements; from the
     lexical information. This is particularly true          multi-lingual perspective, testing and extending
     for Italian, for both P and R;                          the enhancement component successfully used
• the specialization of relations with case                  with English for other languages.
     markers, which turned out to be particularly
     problematic for multi-word markers. This                References
     can be observed mainly for English, for
                                                             Bates Elisabeth. 1976. Language and context: The
     which a different strategy is followed in their
                                                               acquisition of pragmatics. New York, NY: Aca-
     representation;                                           demic Press.
• dependent propagation in coordinated con-
     structions, which is not always easy for both           Cristina Bosco, Vincenzo Lombardo, Leonardo Le-
                                                               smo, Daniela Vassallo. 2000. Building a treebank
     languages. For Italian, the interference with
                                                               for Italian: a data-driven annotation schema. In
     pro-drop subjects should also be considered;              Proceedings of LREC 2000, Athens, Greece.
• other problematic cases include non-
     homogenous conjuncts for which the propa-               Cristina Bosco, Simonetta Montemagni, Maria Simi.
                                                               2012. Harmonization and Merging of two Italian
  Dependency Treebanks, Workshop on Merging of         Marie-Catherine De Marneffe, Timothy Dozat, Nata-
  Language Resources, in Proceedings of LREC             lia Silveira, Katri Haverinen, Filip Ginter, Joakim
  2012, Workshop on Language Resource Merging,           Nivre, Christopher D. Manning. 2014. Universal
  Instanbul, May 2012, ELRA, pp. 23–30.                  Stanford Dependencies: a Cross-Linguistic Typol-
                                                         ogy. In: Proc. LREC 2014, Reykjavik, Iceland,
Cristina Bosco, Simonetta Montemagni, Maria Simi.
                                                         ELRA.
  2013. Converting Italian Treebanks: Towards an
  Italian Stanford Dependency Treebank. In: ACL        Joakim Nivre, Marie-Catherine de Marneffe, Filip
  Linguistic Annotation Workshop & Interoperability       Ginter, Yoav Goldberg, Jan Hajič, Christopher D.
  with Discourse, Sofia, Bulgaria.                        Manning, Ryan McDonald, Slav Petrov, Sampo
                                                          Pyysalo, Natalia Silveira, Reut Tsarfaty, Daniel
Cristina Bosco, Felice Dell’Orletta, Simonetta Mon-
                                                          Zeman. 2016. Universal Dependencies v1: A Mul-
  temagni, Manuela Sanguinetti, Maria Simi. 2014.
                                                          tilingual Treebank Collection. In Proceedings of
  The Evalita 2014 Dependency Parsing task, CLiC-
                                                          LREC.
  it 2014 and EVALITA 2014 Proceedings, Pisa
  University Press, ISBN/EAN: 978-886741-472-7,        Simonetta Montemagni, Maria Simi. 2007. The Italian
  1–8.                                                   dependency annotated corpus developed for the
                                                         CoNLL–2007 shared task. Technical report, ILC–
Marie Candito, Bruno Guillaume, Guy Perrier, Djamé
                                                         CNR.
  Seddah. 2017. Enhanced UD Dependencies with
  Neutralized Diathesis Alternation, Depling 2017 -    Jenna Nyblom, Samuel Kohonen, Katri Haverinen,
  Fourth International Conference on Dependency           Tapio Salakoski and Filip Ginter. 2013. Predicting
  Linguistics, Sep 2017, Pisa, Italy. 2017                conjunct propagation and other extended Stanford
                                                          Dependencies. Proceedings of the Second Interna-
Marie-Catherine de Marneffe and Christopher D.
                                                          tional Conference on Dependency Linguistics (De-
  Manning. 2008. The Stanford typed dependencies
                                                          pLing 2013), pp 252–261, Prague, August 27–30.
  representation. In COLING Workshop on Cross-
  framework and Cross-domain Parser Evaluation.        Maria Simi, Cristina Bosco, Simonetta Montemagni.
                                                         2008. Less is More? Towards a Reduced Inventory
Marie-Catherine de Marneffe, Miriam Connor, Nata-
                                                         of Categories for Training a Parser for the Italian
  lia Silveira, Bowman S. R., Timothy Dozat, Chris-
                                                         Stanford Dependencies. In: Proc. LREC 2014, 26–
  topher D. Manning. 2013. More constructions,
                                                         31, May, Reykjavik, Iceland, ELRA.
  more genres: Extending Stanford Dependencies,
  Proc. of the Second International Conference on      Schuster, Sebastian and Christopher D. Manning. En-
  Dependency Linguistics (DepLing 2013), Prague,         hanced English Universal Dependencies: An Im-
  August 27–30, Charles University in Prague, Mat-       proved Representation for Natural Language Un-
  fyzpress, Prague, pp. 187–196.                         derstanding Tasks.” LREC (2016).
Marie-Catherine de Marneffe and Christopher D.
  Manning. 2013. Stanford typed dependencies man-
  ual, September 2008, Revised for the Stanford Par-
  ser v. 3.3 in December 2013.