Bootstrapping Enhanced Universal Dependencies for Italian Maria Simi Simonetta Montemagni Dipartimento di Informatica Istituto di Linguistica Computazionale Università di Pisa “A. Zampolli” - CNR Largo B. Pontecorvo 3, Pisa Via Moruzzi 1, Pisa simi@di.unipi.it simonetta.montemagni@ilc.cnr.it Marneffe and Manning, 2008): the so-called Abstract “basic” representation where a close parallelism to the source text is maintained (i.e. where each English. The paper presents an extension word of the original sentence is present as a of the Italian Universal Dependencies node), and the so-called “collapsed and propa- Treebank with an “enhanced” representa- gated” representation which was conceived with tion level (e-IUDT), aimed at simplifying a specific view to information extraction tasks. the information extraction process. The Within the current version of UD, the “collapsed modules developed to semi-automatically and propagated” representation has evolved into build e-IUDT were delexicalized to per- the graph-based enhanced representation pro- form cross-language enhancements: pre- posed by Schuster and Manning (2016). liminary experiments in this direction led Since UD version 2.2 (officially released on July to promising results. 2018), “enhanced treebanks” started to appear for a limited number of languages, i.e. English, Italiano. L’articolo presenta l’estensione Finnish, Russian, Polish, Dutch, Latvian. In or- della Universal Dependencies Treebank der to foster the development of enhanced tree- italiana (e-IUDT) con un livello di rappre- banks for other languages, transfer experiments sentazione arricchito (“enhanced”), fina- exploiting existing treebanks are reported in the lizzato a rendere più efficiente ed efficace literature, following both rule-based (Schuster il processo di estrazione dell’informazione. and Manning 2016) and data-driven (Nyblom et I moduli sviluppati per la costruzione se- al., 2013) approaches. mi-automatica della risorsa sono stati de- This paper describes the approach we used for lessicalizzati e utilizzati per il trattamento developing and validating the enhanced version di diverse lingue: esperimenti preliminari of the Italian UD Treebank and reports the first in questa direzione mostrano risultati results of transfer experiments to English. promettenti. 2 Enhanced dependencies 1 Introduction Enhanced dependencies were proposed as a way The Universal Dependencies (UD) project, to simplify the process of information extraction. launched in 2015, aims at developing cross- Enhancements, for the most part, result in addi- linguistically consistent treebank annotation for tional links added to the dependency tree, moti- many languages, with the goal of facilitating vated by inferences, which remain however an- multilingual parser development, cross-lingual chored at the surface representation level. The learning, and parsing research from a language result of enhancing a dependency tree is a graph, typology perspective (Nivre et al., 2016). UD possibly with cycles, but not necessarily a super represents an open community effort with over graph (since some of the original arcs may be 200 contributors producing more than 100 tree- discarded). banks in over 60 languages. The current UD guidelines are quite conserva- Starting from the Stanford Dependencies project, tive, i.e. they suggest practically feasible en- from which Universal Dependencies (UD) origi- hancements only. Despite this, enhancements nate, two syntactic representation options are cannot always be achieved automatically, and the made available, suited to different use cases (De task is challenging enough to be interesting. Ac- cording to the guidelines enhanced graphs may according to whether dependents of the first con- contain some or all of the following enhance- junct are propagated or the head of the first con- ments, described with particular emphasis on junct is propagated instead. Figure 2 shows Ital- Italian: ian examples for each case. 1. Added subject relations in control and raising constructions; a) 2. Shared heads and dependents in coordination; 3. Co-reference in relative clause constructions; 4. Modifier specialization by means of case The book store buys and sells used books. markers; 5. Null nodes for elided predicates. b) 2.1 Added subject relations The book store sells books and magazines In the case of control and raising constructions, Figure 2. a) Dependents propagation b) Head propagation the subject of the subordinated non-finite clause is added. Consider the following examples, with 2.3 Co-reference in relative clauses controlled and raised subjects marked in bold: 1) Subject control: La mamma ha promesso a In basic UD, relative pronouns are normally at- Maria di comprare il pane ‘The mother tached to the main predicate of the relative promised Maria to buy the bread’ clause, typically as nominal subjects (nsubj) or 2) Object control: La mamma ha convinto Ma- direct objects (obj). In the corresponding en- ria a comprare il pane ‘The mother convin- hanced graph, the relative pronoun is linked to ced Maria to buy the bread’ its antecedent with the ref relation and its de- 3) Oblique control: La mamma ha chiesto a pendency to the head of the relative clause is Maria di comprare il pane ‘The mother transferred to the antecedent itself, as exempli- asked Maria to buy the bread’ fied in Figure 3 where it can be observed that the 4) Subject raising: La mamma sembra apprez- resulting enhanced representation contains a cy- zare il pane integrale ‘The mother seems to cle. like whole bread’ Figure 1 shows the UD representation of sen- tence 3), where the added subject relation (marked as nsubj:xsubj) is represented as an The book that I read “enhanced arc” (in blue). Figure 3. Relative clauses 2.4 Specialization of relations Adding case information to the relation name of Figure 1. Enhanced representation of oblique control non-core dependents serves the purpose of dis- ambiguating their semantic role. This infor- Control and raising predicates are superficially mation is expressed in terms of the preposition or very similar, with a main difference: whereas the subordinating conjunction introducing non- Raising predicates have a ‘non-thematic’ argu- core dependents. In particular: nmod and obl ment, all arguments of Control predicates are relation labels, respectively marking nominal and ‘thematic’. Such a distinction is neutralized in oblique modifiers introduced by prepositions, are the enhanced UD representation. In both cases, augmented with language specific case infor- however, the selection of the controlled/raised mation; acl and advcl labels, corresponding argument is lexically-driven. respectively to noun modifying clauses and ad- verbial clauses, are augmented with markers in- 2.2 Sharing in coordination troducing them. A similar type of specialization Coordination is another major source of potential also applies to the conj dependency label link- enhancements, as information shared among con- ing conjuncts in coordinated structures, which is juncts is typically attached only to the first con- specialized with respect to the conjunction type junct and could be propagated to the other con- (e, o, oppure …), as identified by the lemma of juncts, where this is applicable. In propagating the cc dependency (i.e. the relation between a information, it is useful to distinguish two cases, conjunct and a preceding coordinating conjunc- a rare phenomenon in treebanks. Other cases of tion). elision, such as subject elision, are much more meaningful for Italian. 2.6 Open issues After having dinner he went home Besides the standard enhancements foreseen for Figure 4. Adding case and mark information to labels UD illustrated above, we are currently evaluating cases that could be treated as such for Italian, and 2.5 Null nodes for elided predicates could possibly be relevant for other languages as Special null nodes are added in clauses to stand well. These include: for a predicate which is elided; other cases of • case information, which could also be added ellipsis are not being dealt with in the current UD for some core relations such as ccomp. Con- guidelines due to major difficulties in their re- sider as an example the following sentences: construction. This type of enhancement occurs Non so se verrà domani ‘I don’t know when the basic (i.e. pre-enhancement) tree con- whether (he) will come tomorrow’ vs Non so tains an orphan relation which in the enhanced quando arriverà ‘I don’t know when (he) graph is removed and replaced by the recon- will arrive’. Without enhancing the ccomp structed explicit syntactic structure. A new null relation, the semantics of the subordinated node is added in place of the missing predicate clause (conditional vs temporal) remains un- and dependencies are redirected. Figure 5 shows derspecified; an example of predicate elision, along with the • null nodes for elided subjects: Italian is a enhanced version which introduces a new node pro-drop language and the omission of ex- (labeled as E6.1) obtained as a copy of the token plicit subjects occurs quite frequently in ac- ‘chiamava’. tual language usage; according to Bates (1976), the pro-drop rate by adults is 70%. The addition of null nodes for subject ellipsis could significantly enhance the syntactic rep- In intimacy she was calling him captain and he resentation with a view to information ex- [calling her] boss. traction tasks. The typology of representation enhancements could also be further extended to neutralize di- Figure 5. Null nodes for elided predicates athesis alternations, as proposed by Candito et al. (2017) for French. In what follows, we focus on This is the most problematic among the foreseen the standard UD enhancements, excluding the UD enhancements, due to several reasons such treatment of predicate elision for which more as: correct insertion points are difficult to antici- careful investigation and detailed guidelines are pate; phraseological verbs and verbs with clitics required. (either in pronominal form or with clitic com- Table 1. Guessing step: additional annotations plements, see example in Figure 5) would require ExtraSubjOf=id token id is head of a copying a variable number of tokens (the verb new arc to be added RefOf=id and the object with a shift in gender in the case at to current token PropagateDepTo=id hand), which is not always easy to be identified; PropagateHeadWith=label label is the string the appropriate syntactic role of the dependents CaseSpec=label suggested to propa- of the added (i.e. recovered) predicate must be gate or to specialize a MarkSpec=label inferred by proper alignment with the dependents relation of the originally explicit predicate. Moreover, the CcSpec=label proposed UD treatment requires a major change in the treebank format with the addition of new 3 Developing an enhanced UD gold tokens with special labeling and numbering. treebank for Italian Therefore, the introduction of null nodes calls for UD enhanced representation cannot be generated an ad hoc treatment and introduces a complexity through a completely automatic process: this is a in the processing of the treebank which is not task that entails a global vision of the tree to be fully justified if the aim is only to address the completed and often requires additional linguis- cases of predicate elision, for the fact that this is tic knowledge concerning e.g. raising/control properties and/or selectional preferences of pred- ments contained in the developed resource, icates. To build the enhanced Italian UD Tree- which involve 21,75% of the words. Most of bank (henceforth, e-IUDT), we followed a three- them are represented by the specialization of step approach, articulated as follows: modifiers and conjoining relations, immediately 1. Guessing: by making use of heuristics, a followed by head propagation, relative clauses script suggests target nodes whose represen- and extra-subjects. Interestingly enough, it can tation might be enhanced, e.g. the best extra be noticed that the distribution of enhancements subject candidate(s) in raising/control con- remains quite similar across different subsets of structions, or the heads/dependents to be the same language (e.g. the development vs test propagated in coordinated constructions. sets for Italian), whether manually revised (dev) During this step, additional annotations are or not (test), or for another language, English. produced in the representation of involved tokens. For example, the annotation Ex- 4 A language-independent rule-based traSubjOf = j added to token i is an indica- UD enhancer tion that i is an additional subject headed by j. In other cases, the additional annotation Different cross-lingual techniques have been de- indicates a label to be used for specializing a veloped for adding enhanced dependencies to given relation or whether a conjunct should existing UD treebanks, both rule-based (Schuster be propagated. Table 1 summarizes the addi- and Manning 2016) and data-driven (Nyblom et tional annotations used; al., 2013). The modularity of the approach pro- 2. Revising: the human annotator is called to posed for e-IUDT construction created the pre- validate the proposed changes, automatically requisites for reusing some of these components generated during the previous step; for implementing an UD enhancing module. In 3. Enhancing: validated additional annotations what follows, we report preliminary results are used to automatically generate the en- achieved by transforming the heuristics of the hanced UD representation. Enhancements Guessing module into language-independent are not limited to retyping or addition of de- ones. Instead of using language-specific lexical pendencies; in some cases, they involve the information on raising/control properties of verbs reshaping of the dependency graph, and for for identifying extra-subject candidates, follow- this reason an automatic transformation re- ing the general UD strategy we used the heuristic duces the chances of occasional errors. according to which the controlled / raised subject of the embedded clause follows the obliqueness The heuristics behind the guessing step make use hierarchy, i.e. it is the object of the next higher of lexical resources extracted from the corpus it- clause, if there is one, or else its subject. Such a self: this is the case, for example, of lexical in- strategy was extended to foresee also oblique formation on raising/control properties of predi- complements as controlled / raised subjects. The cates, guiding the identification of extra-subject output of the Guessing module is directly passed candidates. to the Enhancing component. In order to test ef- Following the three-step strategy sketched above, fectiveness and generality of the approach we we built a gold standard e-IUDT resource on top tested the rule-based language-independent en- of the development data set of the Italian UD hancer on the Italian and English development treebank (Release 2.2), constituted by 11,908 sets, both available as gold datasets. tokens. In Table 2, the first two columns (headed by “IT DEV (GOLD)”) summarize the enhance- Table 2. Enhanced relations IT TEST EN DEV EN TEST IT DEV (GOLD) (SILVER) (GOLD) (GOLD) words 11.908 10.417 25.150 17.658 enhancements 2.590 21,75% 2.275 21,84% 4.255 16,92% 3.595 20,36% xsubj 69 2,66% 69 3,03% 342 8.04% 251 6,98% ref 127 4,90% 210 9,23% 111 2,61% 274 7,62% conj specializations 322 12,4% 266 11,7% 810 19,03% 532 14,80% dep propagation* 45 1,7% 36 1,6% 165 3,9% 103 2,87% head propagation* 250 9,7% 230 10,1% 478 11,2% 413 11,49% other specializations 1.777 68,6% 1.464 64,4% 2.349 55% 2.022 56,24% For evaluation, we used an adaptation of the gation of dependents or heads cannot always evaluation script used in the evaluation campaign be easily carried out. EVALITA 2014 (Bosco et al., 2014), which is An example follows where, without lexical in- based on a set of relations extracted from the en- formation, the identification of extra subjects hanced graph and for each of them computes fails. Consider the sentence I carri armati … an- Precision, Recall and F1. The evaluation focused davano a Budapest … a spegnere i fuochi ‘The on enhanced relations, thus allowing to analyze tanks ... went to Budapest ... to extinguish the the complexity of the task. Table 3 reports the fires’. In UD, the obl relation covers both lexi- results achieved with the following gold data cally realized indirect objects and other oblique sets: IT-dev, the development dataset from UD- complements: however, without distinguishing ISDT 2.2, enhanced as described above; EN-dev between the two it is impossible to recover the and EN-test, the development and test English extra subject of the infinitive clause. A sugges- datasets from UD-EWT 2.2. tion could be to introduce a specialization of the Table 3. Precision, recall and F1 for enhanced relations obl relation for identifying indirect objects. UAS LAS Dependency specialization turned out to be a P R F1 P R F1 challenging conversion case when applied to the IT-dev 99,7 99,8 99,8 99,5 99,6 99,6 English UD treebank: problems encountered EN-dev 98,2 99,3 98,8 96,2 97,2 96,7 were somehow unexpected, being mostly due to EN-test 99,2 99,0 99,0 97,8 97,6 97,6 a different strategy for annotating multi-word Table 4. Recall and Precision for enhancement type case markers, not always compliant with the IT-dev EN-dev EN-test general UD annotation guidelines. This explains R P R P R P the lower results reported in Table 3 for English xsubj 92,7 98,4 100,0 99,4 99,6 99,0 with respect to Italian. ref 100,0 100,0 99,1 86,6 99,3 94,4 conj spec 99,7 100,0 98,2 94,9 97,9 97,6 other specs 99,9 100,0 97,0 96,7 98,2 98,1 5 Conclusions propagation 97,8 95,7 97,1 97,3 95,5 98,2 We extended the Italian UD Treebank with an enhanced representation level: Italian is now For Italian, despite the de-lexicalization of the among the few languages within UD with a gold Guessing module, UAS and LAS results are enhanced Treebank which will be part of Release quite high. Results are very high also when en- v2.3. The modules used to semi-automatically hancement is carried out against different sets of build e-IUDT were delexicalized to carry out the English UD Treebank. A qualitative error cross-language enhancements: preliminary re- analysis was also performed. Table 4 details re- sults for both Italian and English are promising. call and precision achieved for the different types The contribution also includes better and more of enhancements, for both Italian and English. detailed specifications to the constantly in- The main sources of errors turned out to be: progress guidelines. Current developments in- • the identification of extra-subjects, per- clude: from a mono-lingual perspective, exten- formed on the basis of heuristics rather than sion of the typology of enhancements; from the lexical information. This is particularly true multi-lingual perspective, testing and extending for Italian, for both P and R; the enhancement component successfully used • the specialization of relations with case with English for other languages. markers, which turned out to be particularly problematic for multi-word markers. This References can be observed mainly for English, for Bates Elisabeth. 1976. Language and context: The which a different strategy is followed in their acquisition of pragmatics. New York, NY: Aca- representation; demic Press. • dependent propagation in coordinated con- structions, which is not always easy for both Cristina Bosco, Vincenzo Lombardo, Leonardo Le- smo, Daniela Vassallo. 2000. Building a treebank languages. For Italian, the interference with for Italian: a data-driven annotation schema. In pro-drop subjects should also be considered; Proceedings of LREC 2000, Athens, Greece. • other problematic cases include non- homogenous conjuncts for which the propa- Cristina Bosco, Simonetta Montemagni, Maria Simi. 2012. Harmonization and Merging of two Italian Dependency Treebanks, Workshop on Merging of Marie-Catherine De Marneffe, Timothy Dozat, Nata- Language Resources, in Proceedings of LREC lia Silveira, Katri Haverinen, Filip Ginter, Joakim 2012, Workshop on Language Resource Merging, Nivre, Christopher D. Manning. 2014. Universal Instanbul, May 2012, ELRA, pp. 23–30. Stanford Dependencies: a Cross-Linguistic Typol- ogy. In: Proc. LREC 2014, Reykjavik, Iceland, Cristina Bosco, Simonetta Montemagni, Maria Simi. ELRA. 2013. Converting Italian Treebanks: Towards an Italian Stanford Dependency Treebank. In: ACL Joakim Nivre, Marie-Catherine de Marneffe, Filip Linguistic Annotation Workshop & Interoperability Ginter, Yoav Goldberg, Jan Hajič, Christopher D. with Discourse, Sofia, Bulgaria. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, Daniel Cristina Bosco, Felice Dell’Orletta, Simonetta Mon- Zeman. 2016. Universal Dependencies v1: A Mul- temagni, Manuela Sanguinetti, Maria Simi. 2014. tilingual Treebank Collection. In Proceedings of The Evalita 2014 Dependency Parsing task, CLiC- LREC. it 2014 and EVALITA 2014 Proceedings, Pisa University Press, ISBN/EAN: 978-886741-472-7, Simonetta Montemagni, Maria Simi. 2007. The Italian 1–8. dependency annotated corpus developed for the CoNLL–2007 shared task. Technical report, ILC– Marie Candito, Bruno Guillaume, Guy Perrier, Djamé CNR. Seddah. 2017. Enhanced UD Dependencies with Neutralized Diathesis Alternation, Depling 2017 - Jenna Nyblom, Samuel Kohonen, Katri Haverinen, Fourth International Conference on Dependency Tapio Salakoski and Filip Ginter. 2013. Predicting Linguistics, Sep 2017, Pisa, Italy. 2017 conjunct propagation and other extended Stanford Dependencies. Proceedings of the Second Interna- Marie-Catherine de Marneffe and Christopher D. tional Conference on Dependency Linguistics (De- Manning. 2008. The Stanford typed dependencies pLing 2013), pp 252–261, Prague, August 27–30. representation. In COLING Workshop on Cross- framework and Cross-domain Parser Evaluation. Maria Simi, Cristina Bosco, Simonetta Montemagni. 2008. Less is More? Towards a Reduced Inventory Marie-Catherine de Marneffe, Miriam Connor, Nata- of Categories for Training a Parser for the Italian lia Silveira, Bowman S. R., Timothy Dozat, Chris- Stanford Dependencies. In: Proc. LREC 2014, 26– topher D. Manning. 2013. More constructions, 31, May, Reykjavik, Iceland, ELRA. more genres: Extending Stanford Dependencies, Proc. of the Second International Conference on Schuster, Sebastian and Christopher D. Manning. En- Dependency Linguistics (DepLing 2013), Prague, hanced English Universal Dependencies: An Im- August 27–30, Charles University in Prague, Mat- proved Representation for Natural Language Un- fyzpress, Prague, pp. 187–196. derstanding Tasks.” LREC (2016). Marie-Catherine de Marneffe and Christopher D. Manning. 2013. Stanford typed dependencies man- ual, September 2008, Revised for the Stanford Par- ser v. 3.3 in December 2013.