=Paper=
{{Paper
|id=Vol-2006/paper023
|storemode=property
|title=PARSEME-It Corpus An annotated Corpus of Verbal Multiword Expressions in Italian
|pdfUrl=https://ceur-ws.org/Vol-2006/paper023.pdf
|volume=Vol-2006
|authors=Johanna Monti,Maria Pia di Buono,Federico Sangati
|dblpUrl=https://dblp.org/rec/conf/clic-it/MontiBS17
}}
==PARSEME-It Corpus An annotated Corpus of Verbal Multiword Expressions in Italian==
PARSEME-It Corpus An annotated Corpus of Verbal Multiword Expressions in Italian Johanna Monti1 , Maria Pia di Buono2 , Federico Sangati3 jmonti@unior.it, mariapia.dibuono@fer.hr, federico.sangati@gmail.com 1 Dep. of Literary, Linguistic and Comparative Studies “L’Orientale” University of Naples, Italy 2 TakeLab - University of Zagreb, Croatia 3 Indipendent Researcher, Italy Abstract within a European research network, to elabo- rate universal terminologies and annotation guide- English. This paper describes a new lan- lines for verbal multiword expressions in 18 lan- guage resource annotated with verbal mul- guages, among which also the Italian language tiword expressions (VMWEs) in Italian. is represented. Notably, multiword expressions The paper discusses the state of the art represent a difficult lexical construction to iden- in VMWE identification and annotation in tify, model and treat by Natural Language Process- Italian, the methodology adopted, the vari- ing (NLP) tools, such as parsers, machine trans- ous VMWE categories annotated, the cor- lation engines among others, mainly due to their pus and the annotation process. Finally, non-compositional property. In particular, among the paper ends with results, conclusion and multiword expressions verbal ones are particularly future work. challenging because they have different syntactic structures (prendere una decisione ’make a deci- Italiano. Questo contributo descrive sion’, decisioni prese precedentemente ’decisions una nuova risorsa linguistica annotata made previously’), may be continuous and discon- con polirematiche verbali per la lin- tinuous (andare e venire versus andare in malora gua italiana. Viene presentato lo stato in Luigi ha fatto andare la società in malora), may dell’arte relativamente all’identificazione have a literal and figurative meaning (abboccare ed all’annotazione di polirematiche per la all’amo ’bite the hook’ or ’be deceived’). In this lingua italiana, la metodologia adottata, paper, we describe the state of the art in VMWE le diverse categorie di polirematiche ver- annotation and identification for the Italian lan- bali annotate nel corpus, il corpus stesso e guage (section 2). We then present the method- il processo di annotazione. Infine vengono ology (section 3), the Italian VMWE categories illustrati i risultati ottenuti, le conclusioni taken into account for the annotation task (section e le prospettive future. 4), the corpus and the annotation process (section 5), and the results (section 6). Finally, we discuss conclusions and future work (section 7). 1 Introduction 2 State of the art in VMWE This paper outlines the development of a identification and annotation in Italian new language resource for Italian, namely the Several scholars have investigated different kinds PARSEME-It VMWE corpus, annotated with of Italian VMWEs, focusing on both syntactic and Italian MWEs of a particular class: verbal mul- semantic aspects. Among these works, we may tiword expressions (VMWE). The PARSEME- distinguish contrastive and comparative analyses, It VMWE corpus has been developed by the and synchronic and diachronic studies. PARSEME-IT research group1 in the framework In the first group, most of the scholars propose a of the PARSEME Shared Task on Automatic comparison with Germanic languages (Mateu and Identification of Verbal Multiword Expressions Rigau, 2010), mainly for describing verb-particle (Savary et al., 2017), a joint effort, carried out constructions, that represent a very common phe- 1 https://www.researchgate.net/project/PARSEME-IT- nomenon in this family. Syntactic-Parsing-and-Multiword-Expressions-in-Italian On the other hand, synchronic and diachronic studies include analyses of: (i) verb-particle con- contains inherently reflexive verbs (IReflVs) structions (Masini, 2005; Iacobini and Masini, and verb-particle constructions (VPCs); 2005; Quaglia and Trotzke, 2017), (ii) idiomatic 3. an other VMWEs category, which is a resid- constructions (Tabossi et al., 2011; Vietri, 2014c) ual category for the occurrences not belong- with either ordinary or support verbs (Vietri, ing to any of the previous groups. 2014b), (iii) support, or light, verbs, which rep- resent a wider phenomenon and, for this reason, In order to ease the identification and categori- they have been largely analysed (La Fauci, 1980; sation task of VMWEs, a decision tree method D’Agostino and Elia, 1998; Cicalese, 1999; Alba- was devised with generic and language-specific Salas, 2004; Quochi, 2007; Cicalese et al., 2016). tests. Generic tests consider general criteria that Reflexive verbs in Italian have been investigated are valid for all languages, while language-specific as occurrences of non-local anaphora (Reuland, tests consider structural, lexical, morphological 1990) and considering their syntactic classification and syntactic features that are specific for the indi- (Carstea Romascanu, 1977). vidual languages. The decision tree includes three To the best of our knowledge only a limited num- steps, (i) identification of a VMWE candidate, i.e., ber of monolingual language resources with mul- a combination of a verb with at least one other tiwords for the Italian language have been devel- word, which is a potential VMWE; (ii) identifi- oped such as a dictionary for Italian idioms (Vietri, cation of the lexicalized elements of the expres- 2014a), a series of example corpora and a database sion, (iii) assignment of the VMWE to one of the of MWEs represented around morphosyntactic VMWE categories, using general and language- patterns (Zaninello and Nissim, 2010), or a cor- specific tests. pus annotated with Italian MWEs of a particular class: verb-noun expressions such as fare riferi- 4 Italian VMWEs mento, dare luogo and prendere atto (Taslimipoor For the Italian VMWE annotation task, according et al., 2016). At the time of writing, therefore, the to PARSEME guidelines, multiword expressions PARSEME-It VMWE corpus represents the first are understood as (continuous or discontinuous) sample of a corpus, which includes several types sequences of words with the following compul- of VMWEs, specifically developed for NLP appli- sory properties: cations. • Their component words include a head word 3 Methodology and at least one other syntactically related word. Most often the relation they maintain The development of the Italian VMWE corpus is is a syntactic (direct or indirect) dependency based on the PARSEME annotation guidelines2 , but it can also be e.g., a coordination. provided for the shared task. The guidelines have been developed with the aim of delivering gen- • They show some degree of orthographic, eral definitions and prescriptions for the annota- morphological, syntactic or semantic id- tion of VMWEs in 18 languages, but, at the same iosyncrasy with respect to what is considered time, of allowing language-specific descriptions of general grammar rules of a language. these linguistic phenomena (Savary et al., 2017). • At least two components of such a word se- The annotation guidelines include three main cat- quence have to be lexicalized. egories: In this task we only annotate the lexicalized com- 1. a universal category, which is common to ponents and ignore open slots. Collocations, i.e., all the languages involved in the task and word co-occurrences whose idiosyncrasy is of sta- holds light-verb constructions (LVCs) and id- tistical nature only (e.g., the graphic shows, dras- ioms (ID); tically drop, etc.), are excluded from the scope of this study. The VMWE which have been anno- 2. a quasi-universal category, relevant for tated for the Italian language are: some languages or language families, that 1. Light verb constructions (LVC), which typ- 2 The guidelines are available at http://parsemefr.lif.univ- ically consist of a verb and a noun or prepo- mrs. fr/guidelines-hypertext/. sitional phrase, e.g., fare una domanda (’to make a question’), fare una passeggiata (’to CoNLL format, i.e. lemmatized, POS-tagged and have a walk’). The verb has a purely syntac- annotated with syntactic dependencies. For our tic operator function (performing an activity annotation task, we selected a sub-corpus formed or being in a state), whereas the noun is pred- by 17,000 sentences (corresponding to 421,848 to- icative, often referring to an event (e.g., deci- kens) randomly taken from blogs, Wikipedia and sion, visit) or a state (e.g., fear, courage); Wikinews. The corpus was kept in its original state and therefore no errors or inconsistencies were 2. Idioms (ID), which have at least two lexical- corrected. The pre-annotation of the PAISA´ was ized components including a head verb and at kept in order to ease the annotation work with ref- least one of its arguments, e.g., tirare le cuoia erence to the identification of verbal MWEs but (’kick the bucket’), piovere a catinelle (’rain we asked annotators not to overestimate the sys- cats and dogs’); tem’s performances, and to review the whole text, not only the pre-annotated candidates proposed by 3. Inherently reflexive verbs (IReflV), which the system. A dedicated tag in FLAT was defined are those reflexive verbal constructions which for this purpose. The objective was to have a fi- (a) never occur without the clitic e.g., suici- nal corpus of at least 3,500 annotated VMWEs per darsi (’suicide’), or when (b) the REFLV and language. Since the density of VMWEs highly de- non-reflexive versions have clearly differ- pend on the particular language, as well as text ent senses or subcategorization frames e.g., choice and genre, we were not able to make any re- riferirsi (’refer’); liable estimation of the corpus size needed to reach 4. Verb particle combinations (VPC), which this goal from the beginning of the task. are formed by a lexicalized head verb 5.2 Annotation environment and a lexicalized particle dependent on the verb. The meaning of the VPC is non- The annotation environment used for the compositional. Notably, the change in the PARSEME-It VMWE corpus is FLAT, a web- meaning of the verb goes significantly be- based linguistic annotation environment3 based yond adding the meaning of the particle, e.g., around the FoLiA format4 a rich XML-based buttare giù (’swallow’). This type of con- format for linguistic annotation. FLAT allows struction is very frequent in English, German, users to view annotated FoLiA documents and Swedish, Hungarian, but we can find them enrich these documents with new annotations also in Italian; (Figure 1), a wide variety of linguistic annotation types is supported through the FoLiA paradigm. It 5. Other Verbal MWEs (OTH), which gather is a document-centric tool that fully preserves and the types not belonging to any of the cat- visualises document structure. It is open source egories above, e.g., corto-circuitare (’short- software developed at the Centre of Language circuit’). and Speech Technology, Radboud University Nijmegen and is licensed under the GNU Public 5 Corpus and annotation task License v3. 5.1 PARSEME Italian VMWE corpus 5.3 Annotation task The PARSEME-It VMWE corpus is based on a The annotation task for the Italian language was selection of texts taken from the PAISA´ corpus of performed in five different stages. Italian web texts (Lyding et al., 2014). We chose this corpus because it contains documents (i) from 1. The PARSEME Annotation guidelines were different web sources, e.g., Wikibooks, Wikinews, agreed on5 and examples for the Italian lan- Wikiversity, and several blog services from dif- guage were added in order to ease the anno- ferent websites, collected in 2010 by means of tation task by the Italian annotators. To this a Creative Commons-focused web crawling, and end, a two-phase pilot annotation in Italian a targeted collection of documents from specific 3 http://flat.readthedocs.io/en/latest/ websites, (ii) dedicated to no specific technical 4 http://proycon.github.io/folia domain, free from copyright issues, so as to be 5 http://parsemefr.lif.univ-mrs.fr/parseme-st- compatible with an open license (iii) annotated in guidelines/1.0/?page=home Figure 1: Example of annotated data in FLAT was carried out. This step was useful in iden- (ii) the category attribution concerning for in- tifying the Italian VMWE categories to be an- stance the fare + N VMWE type, since in notated, but also to promote cross-language some cases the category is LVC such as in convergences with the other languages fore- fare rumore and in some others is ID such as seen in the shared task. Each pilot annotation in fare schifo, (iii) the identification of nested phase provided feedback from annotators and VMWEs like in Mi guardo bene where the was followed by enhancements of the guide- annotator has to decide if in the ID guardarsi lines, corpus format and processing tools. bene there is also a IReflV guardarsi or not. 2. A pre-processing step of the PAISA´ corpus 4. A few files were double-annotated to evaluate was needed: a ’no space’ column was added the inter-annotator agreement (IAA). Mea- to the files in order to add the ’nsp’ tag if a suring IAA is not a trivial task because of the token should have been appended to the pre- challenges posed by VMWEs and described vious one without a space. in the Introduction. The available IAA re- sults organized per-VMWE F-score (Funit ), estimated Cohens K (Kunit ) and finally stan- dard K(Kcat ) (Savary et al., 2017) scores are presented in Table 1. 5. Further 1,000 sentences were used as test-set during the shared task. The VMWE anno- tations were automatically annotated by the systems that took part in the shared task and performed according to the same guidelines. #S #T #A1 #A2 Funit Kunit Kcat IT 2000 52639 336 316 0.417 0.331 0.78 Table 1: AA scores for Italian annotation: #S, Figure 2: Example of the use of an nsp tag and #T show the number of sentences and tokens in the corpora used for measuring the IAA, re- spectively. #A1 and #A2 refer to the number of 3. The annotation task of the training set (ap- VMWE instances annotated by each of the anno- prox. 16,000 sentences) was manually per- tators (Savary et al., 2017). formed in running texts using the FLAT envi- ronment by five Italian native speakers with linguistic background. Each annotator was 6 Results given a certain number of files, containing The PARSEME-It VMWE corpus is composed of 1,000 sentences in CoNLL format. All the 2,454 entries (Table 2), and it is freely available6 , doubts about the annotation were collected in released under Creative Commons licenses. a shared file and discussed during the annota- The data have been annotated using the official tion phase. Difficulties in annotating VMWE parseme-tsv format7 (Figure 3), adapted from the mainly concerned (i) the boundaries of the CoNLL format. VMWE such as in Sei ovviamente nel pieno 6 diritto di esprimere [...] where it is diffi- http://hdl.handle.net/11372/LRT-2282 7 http://typo.uni-konstanz. de/parseme/index.php/2- cult to decide if the VMWE should be sei general/ 184-parseme-shared-task-format-of-the-final- ... nel ... diritto or sei ... nel pieno diritto, annotation. Category Occurrences leased. These companion files contain extra lin- ID 1163 guistic information, i.e., lemmas, POS-tags, mor- IReflV 730 phological features, and syntactic dependencies. LVC 482 VPC 73 7 Conclusion and Future Work OTH 6 Total 2454 In this paper, we described a linguist resource of Italian VMWE, developed within the PARSEME Shared Task on Automatic Identification of Table 2: Overview of VMWEs in the PARSEME- VMWE. We consider this work an initial contribu- It VMWE corpus, including train and test sets. tion for elaborating an Italian universal terminol- ogy of VMWE. Future work includes the exten- sion of the current corpus and a fine-grained lin- guistic analysis of the annotation in order to con- tribute to the description of these phenomena. Acknowledgments The work described in this paper has been sup- ported by the IC1207 PARSEME COST action. The annotation work was carried out also thanks to the help of Maarten van Gompel who adapted the FLAT annotation platform to the needs of the community. Our thanks go also to the Italian annotators, Va- leria Caruso, Manuela Cherchi, Anna De Santis, Annalisa Raffone, for their contributions. Autorship contribution is as follows: Johanna Monti is author of Sections 1, 3, 4 and 5.3; Maria Pia di Buono of Sections 2 and 6 and 7; and Fed- Figure 3: Example of annotated data in parseme- erico Sangati of Sections 5.1 and 5.2. tsv format In the official parseme-tsv format, as described References in Savary et al. (2017), the information about each Josep Alba-Salas. 2004. Fare light verb construc- token are represented by 4 tab-separated columns tions and italian causatives: Understanding the dif- featuring (i) the position of the token in the sen- ferences. ITALIAN JOURNAL OF LINGUISTICS, tence or a range of positions (e.g., 1-2) in case of 16(2):283. multiword tokens such as contractions, (ii) the to- M Carstea Romascanu. 1977. I tipi di verbi riflessivi ken surface form, (iii) an optional flag indicating in italiano. Revue Roumaine de Linguistique Bu- that the current token is adjacent to the next one, curesti, 22(2):125–130. and (iv) an optional VMWE code composed of the VMWEs consecutive number in the sentence and Anna Cicalese, Emilio D’Agostino, Alberto Maria Langella, and Ilaria Villari. 2016. Els verbs lo- for the initial token in a VMWE its category (e.g., catius com a variants de verbs de suport. Quaderns 2:ID if a token starts an idiom which is the sec- d’Italià, 21:153–166. ond VMWE in the current sentence). In case of nested, coordinated or overlapping VMWEs mul- Anna Cicalese. 1999. Le estensioni di verbo supporto. uno studio introduttivo. Studi italiani di linguistica tiple codes are separated with a semicolon. Fur- teorica ed applicata, 28(3):447–485. thermore, in order to provide data usable as fea- tures in the shared task systems, also companion Emilio D’Agostino and Annibale Elia. 1998. Il signi- files in a format close to CoNLL-U8 have been re- ficato delle frasi: un continuum dalle frasi semplici alle forme polirematiche. AA. VV, Ai limiti del lin- 8 http://universaldependencies.org/format.htm guaggio. Bari: Laterza, pages 287–310. Claudio Iacobini and Francesca Masini. 2005. Verb- Simonetta Vietri. 2014b. Idiomatic Constructions in particle constructions and prefixed verbs in italian: Italian: A Lexicon-grammar Approach, volume 31. typology, diachrony and semantics. In Mediter- John Benjamins Publishing Company. ranean Morphology Meetings, volume 5, pages 157–184. Simonetta Vietri. 2014c. The lexicon-grammar of ital- ian idioms. In Workshop on Lexical and Grammat- Nunzio La Fauci. 1980. Aspects du mouvement de ical Resources for Language Processing, COLING wh, verbes supports, double analyse, complétives au 2014, pages 137–146. subjonctif en italien: pour une description compacte. Lingvisticae Investigationes, 4(2):293–341. Andrea Zaninello and Malvina Nissim. 2010. Creation of lexical resources for a characterisation of multi- Verena Lyding, Egon Stemle, Claudia Borghetti, Marco word expressions in italian. In LREC. Brunello, Sara Castagnoli, Felice DellOrletta, Hen- rik Dittmann, Alessandro Lenci, and Vito Pirrelli. 2014. The paisa corpus of italian web texts. In Pro- ceedings of the 9th Web as Corpus Workshop (WaC- 9), pages 36–43. Francesca Masini. 2005. Multi-word expressions be- tween syntax and the lexicon: the case of italian verb-particle constructions. SKY Journal of Linguis- tics, 18(2005):145–173. Jaume Mateu and Gemma Rigau. 2010. Verb-particle constructions in romance: A lexical-syntactic ac- count. Probus, 22(2):241–269. Stefano Quaglia and Andreas Trotzke. 2017. Italian verb particles and clausal positions. In IATL 31: The 31st annual meeting Israel Association for Theoret- ical Linguistics, pages 67–82. Valeria Quochi. 2007. A usage-based approach to light verb constructions in italian: Development and use. Eric Reuland. 1990. Reflexives and beyond: Non-local anaphora in italian revisited. Grammar in progress: glow essays for Henk van Riemsdijk, 36:351. Agata Savary, Carlos Ramisch, Silvio Cordeiro, Fed- erico Sangati, Veronika Vincze, Behrang Qasem- izadeh, Marie Candito, Fabienne Cap, Voula Giouli, Ivelina Stoyanova, et al. 2017. The parseme shared task on automatic identification of verbal multiword expressions. In Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), pages 31– 47. Patrizia Tabossi, Lisa Arduino, and Rachele Fa- nari. 2011. Descriptive norms for 245 italian id- iomatic expressions. Behavior Research Methods, 43(1):110–123. Shiva Taslimipoor, Anna Desantis, Manuela Cherchi, Ruslan Mitkov, and Johanna Monti. 2016. Lan- guage resources for italian: towards the develop- ment of a corpus of annotated italian multiword ex- pressions. In Proceedings of Third Italian Confer- ence on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Work- shop (EVALITA 2016). ceur-ws. Simona Vietri. 2014a. The italian module for nooj. In Proceedings of the First Italian Conference on Com- putational Linguistics, CLiC-it, pages 389–393.