PARSEME-IT Issues in verbal Multiword Expressions identification and classification Johanna Monti1 , Valeria Caruso1 , Maria Pia di Buono2 1 Dep. of Literary, Linguistic and Comparative Studies “L’Orientale” University of Naples, Italy 2 TakeLab - Faculty of Electrical Engineering and Computing - University of Zagreb, Croatia jmonti@unior.it, vcaruso@unior.it, mariapia.dibuono@fer.hr Abstract of the ACL Special Interest Group on the Lexicon, called SIGLEX-MWE. English. The second edition of the In its first edition, the PARSEME shared task PARSEME shared task was based on new released a corpus of 5.5 million tokens and 60,000 guidelines and methodologies that partic- VMWE annotations in 18 different languages ularly concerned the Italian language with which is distributed under different versions of the the introduction of new categories of verbs Creative Commons license. To increase the com- not considered in the previous edition. putational efficiency of Natural Language Pro- This contribution presents the novelties cessing (NLP) applications, PARSEME focuses introduced, the results obtained and the on a special class of Multiword Expressions which problems that emerged during the anno- have been seldom modelled for their challenging tation process and concerning some cate- nature, such as verbal MWEs (Savary et al., 2017). gories of verbs. Many of the features of this particular type of MWE are considered to be difficult to cope with, Italiano. La seconda edizione del such as the discontinuity they present (turn it off) PARSEME shared task si è basata su the syntactic variations they license (the decision nuove linee guida e metodologie che was hard to take), the semantic variability re- hanno riguardato in particolar modo la sulting both in literal and idiomatic readings (to lingua italiana con l’introduzione di nuove take the cake), or the syntactic ambiguity of many categorie di verbi non considerate nella forms (on is a preposition in to trust on some- precedente edizione. Il contributo pre- body, but a particle in to take on the task). More- senta le novità introdotte, i risultati ot- over, these units have language-specific features, tenuti e le problematiche che sono emerse and are generally modelled according to descrip- durante l’annotazione relativamente ad tive categories developed by different traditions of alcune categorie di verbi. linguistic studies. The PARSEME research group thus addresses also the creation of a multilingual common platform for VMVEs using universal ter- 1 Introduction minology, guidelines and methodologies for the The paper reports on some final results of the sec- identification of these units cross-linguistically. ond edition of an annotation trial for verbal Mul- Moreover, at the end of the first annotation trial a tiword Expressions (VMWEs) carried out on the shared task on automatic identification of VMWEs Italian language by the PARSEME-IT research was also carried out and has proved the reliability group 1 , which started within the broader Euro- and usefulness of the data collected so far, which pean PARSEME project, the IC1207 COST action have been already presented and discussed (Savary ended in April 20172 . et al., 2017; Monti et al., 2017). The initial project is expanding in this second The paper illustrates the types of VMWEs used stage of its development, thanks to a wider net- by the second PARSEME annotation trial more work of research groups, working together as one thoroughly. In Section 2 we provide a brief de- 1 scription of the second annotation trial of the https://sites.google.com/view/ parseme-it/home PARSEME shared task together with the statistics. 2 https://typo.uni-konstanz.de/parseme/ Then we present a new category of verbal MWEs, namely Inherently Clitic Verbs (Section 3) and in The other classifying categories used are (a) Section 4 two very productive categories in Ital- light verb constructions (LVCs), e.g. fare una ian (IRV and IDV). In Section 5, we discuss some passeggiata (to have a walk), and (b) idioms (ID), borderline cases which posed some classification e.g., tirare le cuoia (to kick the bucket), consid- issues. Finally, we conclude and discuss future ered to be universal categories or categories which work. can be found in all languages participating in the task. 2 PARSEME Shared Task Second Other VMWEs are instead maintained as quasi- annotation trial: a brief report universal categories, since their range of applica- tion seems to cover only some language groups or This section focuses on the novelties which have languages, but not all. They are (c) inherently been introduced in the guidelines and methodolo- reflexive verbs (IReflVs), and (d) verb-particle gies used for the second annotation trial in order constructions (VPCs). The first group (IReflVs) to cover a wider range of VMWEs which were left allows annotators to account for verbs which are apart in the first stage of the project. The improve- never used without a reflexive clitic pronoun, e.g., ments seem to be particularly valuable for the data (Italian) suicidarsi (to suicide), or for those verbs collection carried out on the Italian language, be- whose meaning is significantly affected by the cause they address some peculiarities of the Ital- pronoun, e.g., (Italian) farsi (to take drugs) while ian language which were not considered in the first the non-pronominal form, fare, means to make. edition of the shared task but have been taken into Semantic aspects are also used to identify Verb- account in the second edition, namely: particle constructions (VPC) because their mean- • Inherently clitic verbs (ICV), which is an ing is fully non-compositional, e.g., buttare giù (to extremely rich and varied VMWE category in swallow), or only partly non-compositional, like Italian (Masini, 2015). As described in Sec- in tirare avanti (to go on) since the preposition no tion 3, a language specific category was cre- longer owns its spatial meaning. ated for the Italian language (LS.ICV) which Table 1 presents the statistics of the various cat- takes into account only those verbs whose se- egories of VMWEs in the PARSEME-IT corpus mantics is changed by a non-reflexive clitic 1.1. pronoun, like entrarci when it means to be relevant to something, while the intranstive 3 A language specific category: form of the verb entrare means to enter. Inherently clitic verbs (LS.ICV) • Inherently adpositional verbs (IAV), a high Inherently Clitic Verbs (LS.ICV) represent a spe- frequency category of VMWEs, namely cific category for some Romance languages, and those verbs whose meanings are significantly they are particularly frequent in the Italian lan- affected by an “idiomatic selected preposi- guage. It is often challenging to distinguish tion”, like su in contare su qualcuno (to LS.ICV from Inherently Reflexive Verbs (IRV), rely on someone): without the preposition particularly because some clitics may be ambigu- the verb means only to determine the total ous, like se/si which is a polyfunctional clitic pro- number of something. These verbs are often noun and grammatical marker (and can have a re- called prepositional verbs3 . flexive, reciprocal, impersonal, passivizing, aspec- tual, and middle function). LS.ICVs together with • Multi verb constructions (MVC), VMWEs IRVs are pronominal verbs. LS.ICV are formed composed by a sequence of two adjacent by a full verb combined with one or more non- verbs (in a language-dependent order), a gov- reflexive clitic that represents the pronominaliza- erning verb V gov (also called a vector verb) tion of one or more complement (CLI). and a dependent verb V dep (also called a po- The following verbs should be annotated as lar verb), like in lasciar perdere (to give up). LS.ICV: 3 Schneider, N., Green, M., 2015, New • The verb without the CLI does not exist, e.g., Guidelines for Annotating Prepositional Verbs, https://github.com/nschneid/nanni/wiki/Prepositional- infischiarsene (do not worry about) vs *infis- Verb-Annotation-Guidelines chiare; sent. tokens VMWEs IAV IRV LS.ICV LVC.cause/full MVC VID VPC.full/semi IT-dev 917 32613 500 44 106 9 19/100 6 197 17/2 IT-train 13555 360883 3254 414 942 20 147/544 23 1098 66/0 Table 1: PARSEME-IT corpus version 1.1 • The verb without the CLI does exist, but has anaphoric expression which stands for se stesso a very different meaning as in prenderle (gl.: (oneself) or a mutual expression which refers to to take them, transl. to be beaten) vs prendere gli uni e gli altri (these and those). Another rele- (to take) or prenderci (gl.: to take it, transl. to vant aspect to consider in the classification of IRVs grasp the truth) vs prendere (to take); is the presence of an implicit thematic role due to the fact that the action includes two different en- • The verb has more than one CLI of which the tities with different thematic properties but with second one is an invariable object comple- the same reference, e.g., in guardarsi (to look at ment, like in fregarsene (gl.: matter self of- oneself) the clitic signals the presence of coref- it, transl. do not care about) or infischiarsene erence between the first argument and the second (do not worry about); one. Another source of mis-classification of IRVs • The verb has two non-reflexive invariable is related to the presence of anticausative construc- CLIs, like in farcela (gl.: to make there it, tions. In these constructions, the clitic may repre- transl. to succeed); sent an overt marker of reduced transitivity, , e.g., sedersi (to sit down). • The verb has a different meaning with re- In some cases, IRVs occur in idiomatic construc- spect to an intensive use of the same two non- tion and their meaning is affected by the presence reflexive invariable CLIs, like in andarsene of new elements, such as in guardarsi bene da (to (gl.: to go away self from-there, transl. to die) be careful not to). Consequently the annotation of vs andarsene (to go away) or bersela (gl.: such occurrences is subjected to the evaluation of drink self it, transl. to believe) vs bersela (to characteristics related to VID, as the low variabil- drink it). ity, the presence of semantic non-compositional meaning, and the literal-idiomatic ambiguity. In The annotation of LS.ICV was performed follow- the VID class, the non-compositionality prop- ing a specific decision tree 4 . erty is prototypical such as in battersi all’ultimo In the training corpus 20 different LS.ICV were sangue (lit. to fight till the last blood) which annotated manually, such as farcela, rimetterci, means to fight to the last. Despite their mean- fregarsene among others. ing is opaque, sometimes VID may have both a literal and idiomatic meaning and the boundaries 4 Very productive VMWEs: IRVs and between them are difficult to trace. For example, VID avere gli occhi bendati (lit. to have the eyes cov- IRVs and VID represent very productive cate- ered) has both a literal meaning and an idiomatic gories in Italian which pose some classifying is- one and in this latter case it should be translated sues due to their specific characteristics. in English as to be blindfold. According to Vi- With reference to IRVs, the presence of the etri (2014b), it is possible to classify ordinary- clitic pronoun si may generate ambiguity in the verb VID, namely VID which present a semanti- annotation process, as in Italian it refers to three cally full verb, on the basis of their definitional different types of construction: i) reflexive, ii) im- structure, identified by means of the arguments re- personal, iii) inherent. quired by the operators. In the case of VID, the In order to distinguish these cases, we consider operator consists of the verb and the fixed ele- that in the reflexive construction, the clitic pro- ment(s), while the argument may be the subject noun can be paraphrased by means of either an and/or a free complement. VIDs can be formed 4 also by constructions based on the use of support http://parsemefr.lif.univ-mrs.fr/ parseme-st-guidelines/1.1/?page=060_ verbs, namely avere (to have), e.g., avere fegato Language-specific_tests/015_Inherently_ (lit. to have leaver, transl. to have guts) essere clitic_verbs__LB_LS.ICV_RB_ (to be), e.g., essere a cavallo (to be golden) and following: fare (to make), e.g., fare lo gnorri (to play fool). The main difference between this class of VID and 1.a Tendere a + N (to be inclined to something), the one formed by ordinary verbs is that support base form tendere (to stretch), e.g., Maria verbs are semantically empty, and for this reason tende alla depressione (Maria tends to be de- this class of VID presents a high degree of lexical pressed); and syntactic variability. This type of variability is retrievable in aspectual variants, the production 1.b Tendere a + V (to be inclined to something), of causative constructions, the possible deletion of e.g., Maria tende a dimagrire (Maria tends to the support verb which causes complex nominal- loose weight); izations (Vietri, 2014a). 2. Puntare su + N (to bet), base form puntare (to 5 Borderline cases: LVC and IAV stick), e.g., puntare su qualcuno/qualcosa. compared In this section we discuss the novelties concerning These examples exhibit clear semantic changes two categories used in the second edition of the from the non-adpositional base form of the verb; PARSEME shared identification task of verbal moreover, the preposition can not be omitted in MWEs (edition 1.1), namely LVC and IAV. As questions, thus proving to be part of the verb: regards LVC, two new subcategories have been - Maria tende sempre ad esagerare. introduced in the second edition, LVC.full and - A cosa tende, scusa? LVC.cause, to account for a more fine-grained Less prototypical IAV examples include verb distinction between LVCs, where the verb is instances exhibiting semantic changes pivoted semantically totally bleached (e.g., to have the by the arguments they combine with, like an- right), and those where the verb adds a causative dare in (both to go to and to become), or meaning (and a new semantic role) to the noun sapere di (to smell and to know about). The (e.g., to grant the right). Therefore some new type of semantic interaction at stake, called co- tests have been added to account for these sub- composition in the Generative Lexicon5 , is real- categories, which heavily rely on the notion of ized when ”the complements carry information semantic arguments. which acts on the governing verb, essentially tak- In particular, constructions annotated as ing the verb as argument and shifting its event LVC.cause may involve: i) verbs that are type” (Pustejovsky, 1995). For example, an- typically used to express the cause of predicative dare in denotes directed motion when combined nouns in general (e.g., cause, provoke), ii) verbs with proper or common place nouns like in an- that are only used to express the cause of partic- dare in città/montagna/America, (to go to the ular predicative nouns (e.g., grant in to grant a city/mountain/America); or the medium of mo- right). tion, when combined with vehicles names, like in IAV consists of a verb or VMWE and an idiomatic vado in bici/Ferrari, (I ride my bike/drive my Fer- selected preposition or postposition that is either rari). However, with nouns denoting states, like always required or, if absent, changes the meaning andare in estasi (to become absorbed) or andare of the verb of the VMWE significantly. IAVs are in panico (to start feel panic), the verb acquires verb+adposition combinations in which: i) the the aspectual meaning of to go into the state X, and dependents of the adposition are not lexicalized, can not be classified as an LVC. With names refer- or ii) the adposition cannot be omitted without ring to events, instead, like andare in soccorso (lit. markedly altering the meaning of the verb. During to go in assistance), the original spatial semantics the annotation trial, the IAV category has proved bleaches by interacting with the name meaning: to be advantageous to cover the rich inventory actually to go into the event X denotes the action of VMWEs in Italian, but some issues have also expressed by the predicative name and can be clas- emerged, particularly with respect to the other sified as an LVC. class of LVC verbs, which also accounts for com- binations of verbs plus prepositions. Prototypical 5 Co-composition has been called ‘accommodation in examples of IAV collected so far include the more recent works (Pustejovsky, 2013). 6 Conclusions and Future Work expressions. In Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), pages 31– In this paper we described the novelties concern- 47. ing the PARSEME shared task on automatic iden- Simonetta Vietri. 2014a. Idiomatic Constructions in tification of verbal MWEs - edition 1.1 (2018), in Italian: A Lexicon-grammar Approach, volume 31. which new verb categories have been included in John Benjamins Publishing Company. comparison with the 2017 edition. Some of them Simonetta Vietri. 2014b. The lexicon-grammar of ital- are language-specific, such as ICV for some Ro- ian idioms. In Workshop on Lexical and Grammat- mance languages, others are not, like IAV. The ical Resources for Language Processing, COLING increased number of categories enables to anno- 2014, pages 137–146. tate corpus data more thoroughly, and discover a broad range of combinatorial phenomena that present different degrees of opacity. We also discussed two productive categories in Italian, namely IRV and VID, and analyzed LVC and IAV borderline cases together with observa- tions on combinatorial phenomena that can be ap- plied in order to annotate VMWE more effectively. Future work includes a further linguistic analy- sis of borderline cases in order to contribute to the description of these phenomena. Acknowledgments This research has been partly supported by the European Regional Development Fund under the grant KK.01.1.1.01.0009 (DATACROSS). Authorship contribution is as follows: Johanna Monti is author of Sections 1, 2, and 3; Valeria Caruso of Section 5, and Maria Pia di Buono of Sections 4 and 6. References Francesca Masini. 2015. Idiomatic verb-clitic con- structions: lexicalization and productivity. In Mediterranean Morphology Meetings, volume 9, pages 88–104. Johanna Monti, Maria Pia di Buono, and Federico San- gati. 2017. Parseme-it corpus an annotated corpus of verbal multiword expressions in italian. In Fourth Italian Conference on Computational Linguistics- CLiC-it 2017, pages 228–233. Accademia Univer- sity Press. James Pustejovsky. 1995. The generative lexicon. MIT Press. James Pustejovsky. 2013. Type theory and lexical de- composition. In Advances in generative lexicon the- ory, pages 9–38. Springer. Agata Savary, Carlos Ramisch, Silvio Cordeiro, Fed- erico Sangati, Veronika Vincze, Behrang Qasem- izadeh, Marie Candito, Fabienne Cap, Voula Giouli, Ivelina Stoyanova, et al. 2017. The parseme shared task on automatic identification of verbal multiword