Advances in Multiword Expression Identification for the Italian language: The PARSEME shared task edition 1.1 Johanna Monti1 , Silvio Ricardo Cordeiro2 , Carlos Ramisch2 Federico Sangati1 , Agata Savary3 , Veronika Vincze4 1 University L’Orientale, Naples, Italy 2 Aix Marseille Univ, CNRS, LIS, Marseille, France 3 University of Tours, France 4 MTA-SZTE Research Group on Artificial Intelligence, Hungary,,, Abstract This contribution will focus on the advances in the identification of verbal multiword expressions English. This contribution describes the (VMWEs) for the Italian language. In Section 2 results of the second edition of the shared we discuss related work. In Section 3 we give an task on automatic identification of verbal overview of the PARSEME shared task. In Section multiword expressions, organized as part 4 we present the resources developed for the Ital- of the LAW-MWE-CxG 2018 workshop, ian language, namely the guidelines and the cor- co-located with COLING 2018, concern- pus. Section 5 is devoted to the annotation pro- ing both the PARSEME-IT corpus and the cess and the inter-annotator agreement. Section 6 systems that took part in the task for the briefly describes the thirteen systems that took part Italian language. The paper will focus on in the shared task and the results obtained. Finally, the main advances in comparison to the we discuss conclusions and future work (Section first edition of the task. 7). Italiano. Il presente contributo de- 2 Related work scrive i risultati della seconda edizione dello ’Shared task on automatic identi- MWEs have been the focus of the PARSEME fication of verbal multiword expressions’ COST Action, which enabled the organization of organizzato nell’ambito del LAW-MWE- an international and highly multilingual research CxG 2018 workshop realizzato durante community (Savary et al., 2015). This commu- il COLING 2018 riguardo sia il cor- nity launched in 2017 the first edition of the pus PARSEME-IT e i sistemi che hanno PARSEME shared task on automatic identifica- preso parte nel task per quel che riguarda tion of verbal MWEs, aimed at developing uni- l’italiano. L’articolo tratta i principali versal terminologies, guidelines and methodolo- progressi ottenuti a confronto con la prima gies for 18 languages, including the Italian lan- edizione del task. guage (Savary et al., 2017). The task was co- located with the 13th Workshop on Multiword Ex- pressions (MWE 2017), which took place dur- 1 Introduction ing the European Chapter of the Association for Multiword expressions (MWEs) are a particularly Computational Linguistics (EACL 2017). The challenging linguistic phenomenon to be handled main outcomes for the Italian language were the by NLP tools. In recent years, there has been a PARSEME-IT Corpus, a 427-thousand-word an- growing interest in MWEs since the possible im- notated corpus of verbal MWEs in Italian (Monti provements of their computational treatment may et al., 2017) and the participation of four sys- help overcome one of the main shortcomings of tems1 , namely TRANSITION, a transition-based many NLP applications, from Text Analytics to dependency parsing system (Al Saied et al., 2017), Machine Translation. Recent contributions to this SZEGED based on the POS and dependency mod- topic, such as Mitkov et al. (2018) and Constant ules of the Bohnet parser (Simkó et al., 2017), et al. (2017) have highlighted the difficulties that ADAPT (Maldonado et al., 2017) and RACAI this complex phenomenon, halfway between lexi- (Boroş et al., 2017), both based on sequence la- con and syntax, characterized by idiosyncrasy on 1 various levels, poses to NLP tasks. sharedtaskresults2017 beling with CRFs. Concerning the identification • O PEN TRACK: Systems using or not the pro- of verbal MWEs some further recent contributions vided training/development data, plus any ad- specifically focusing on the Italian language are: ditional resources deemed useful (MWE lex- icons, symbolic grammars, wordnets, raw • A supervised token-based identification ap- corpora, word embeddings, language mod- proach to Italian Verb+Noun expressions that els trained on external data, etc.). This track belong to the category of complex predi- includes notably purely symbolic and rule- cates (Taslimipoor et al., 2017). The ap- based systems. proach investigates the inclusion of concor- dance as part of the feature set used in su- The PARSEME members elaborated for each lan- pervised classification of MWEs in detecting guage i) annotation guidelines based on annotation literal and idiomatic usages of expressions. experiments ii) corpora in which VMWEs are an- All concordances of the verbs fare (‘to do/ to notated according to the guidelines. Corpora were make’), dare (‘to give’), prendere (‘to take’) split in training, development and tests corpora for and trovare (‘to find’) followed by any noun, each language. Manually annotated training and taken from the itWaC corpus (Baroni and Kil- development corpora were made available to the garriff, 2006) using SketchEngine (Kilgarriff participants in advance, in order to allow them to et al., 2004) are considered. train their systems and to tune/optimize the sys- tems’ parameters. Raw (unannotated) test corpora • A neural network trained to classify and rank were used as input to the systems during the eval- idiomatic expressions under constraints of uation phase. The contribution of the PARSEME- data scarcity (Bizzoni et al., 2017). IT research group3 to the shared task is described With reference to corpora annotated with VMWEs in the next section. for the Italian language and in comparison with the 4 Italian resources for the shared task state of the art described in Monti et al. (2017), there are no further resources available so far. At The PARSEME-IT research group contributed to the time of writing, therefore, the PARSEME-IT the edition 1.1 of the shared task with the develop- VMWE corpus still represents the first sample of ment of specific guidelines for the Italian language a corpus which includes several types of VMWEs, and with the annotation of the Italian corpus with specifically developed to foster NLP applications. over 3,700 VMWEs. The corpus is freely available, with the latest ver- sion (1.1) representing an enhanced corpus with 4.1 The shared task guidelines some substantial changes in comparison with ver- The 2018 edition of the shared task relied on en- sion 1.0 (cf. Section 4). hanced and revised guidelines (Ramisch et al., 2018). The guidelines4 are provided with Italian 3 The PARSEME shared task examples for each category of VMWE. The second edition of the PARSEME shared task The guidelines include two universal categories, on automatic identification of verbal multiword i.e. valid for all languages participating in the task: expressions (VMWEs) was organized as part of • Light-verb constructions (LVCs) with two the LAW-MWE-CxG 2018 workshop co-located subcategories: LVCs in which the verb is with COLING 2018 (Santa Fe, USA)2 and aimed semantically totally bleached (LVC.full) like at identifying verbal MWEs in running texts. Ac- in fare un discorso (‘to give a speech’), and cording to the rules set forth in the shared task, LVCs in which the verb adds a causative system results could be submitted in two tracks: meaning to the noun (LVC.cause) like in dare • C LOSED TRACK: Systems using only the il mal di testa (‘to give a headache’); provided training/development data - VMWE • Verbal idioms (VIDs) like gettare le perle ai annotations + morpho-syntactic data (if any) porci (‘to throw pearls before swine’). - to learn VMWE identification models 3 and/or rules. parseme-it/home 2 4 https:http://multiword.sourceforge. net/lawmwecxg2018 parseme-st-guidelines/1.1/ Three quasi-universal categories, valid for some 4.2 The PARSEME-IT corpus language groups or languages but non-existent or The PARSEME-IT VMWE corpus version 1.1 is very exceptional in others are: an updated version of the corpus used for edition • Inherently reflexive verbs (IRV) which are 1.0 of the shared task. It is based on a selection those reflexive verbal constructions which of texts from the PAISÀ corpus of web texts (Lyd- (a) never occur without the clitic e.g. sui- ing et al., 2014), including Wikibooks, Wikinews, cidarsi (‘to suicide’), or when (b) the IRV Wikiversity, and blog services. The PARSEME- and non-reflexive versions have clearly dif- IT VMWE corpus was updated in edition 1.1 ac- ferent senses or subcategorization frames e.g. cording to the new guidelines described in the pre- riferirsi (‘to refer’) opposed to riferire (‘to re- vious section. Table 4.2 summarizes the size of port / to tell’); the corpus developed for the Italian language and presents the distribution of the annotated VMWEs • Verb-particle constructions (VPC) with per category. two subcategories: fully non-compositional The training, development and test data are VPCs (VPC.full), in which the particle to- available in the LINDAT/Clarin repository5 , and tally changes the meaning of the verb, like all VMWE annotations are available under Cre- buttare giù (‘to swallow’) and semi non- ative Commons licenses (see files compositional VPCs (VPC.semi), in which for details). The released corpus’ format is based the particle adds a partly predictable but non- on an extension of the widely-used CoNLL-U file spatial meaning to the verb like in andare format.6 avanti (‘to proceed’); 5 Annotation process • Multi-verb constructions (MVC) com- posed by a sequence of two adjacent verbs The annotation was manually performed in run- like in lasciar perdere (‘to give up’). ning texts using the FoLiA linguistic annotation tool7 (van Gompel and Reynaert, 2013) by six Ital- An optional experimental category (if admitted ian native speakers with a background in linguis- by the given language, as is the case for Italian) is tics, using a specific decision tree for the Italian considered in a post-annotation step: language for joint VMWE identification and clas- • Inherently adpositional verbs (IAVs), sification.8 which consist of a verb or VMWE and an In order to allow the annotation of IAVs, a new idiomatic selected preposition or postpo- pre-processing step was introduced to split com- sition that is either always required or, if pound prepositions such as della (‘of the’) into two absent, changes the meaning of the verb tokens. This step was necessary to annotate only significantly, like in confidare su (‘to trust lexicalised components of the IAV, as in portare on’). alla disperazione, where only the verb and the preposition a should be annotated, without the ar- Finally, a language-specific category was intro- ticle la. duced for the Italian language: Once the annotation was completed, in order to reduce noise and to increase the consistency of the • Inherently clitic verbs (LS.ICV) formed by annotations, we applied the consistency checking a full verb combined with one or more non- tool developed for edition 1.0 (Savary et al., forth- reflexive clitics that represent the pronom- coming). The tool groups all annotations of the inalization of one or more complements same VMWE, making it possible to spot annota- (CLI). LS.ICV is annotated when (a) the verb tion inconsistencies very easily. never occurs without one non-reflexive clitic, like in entrarci (‘to be relevant to some- 5 6 thing’), or (b) when the LS.ICV and the non- 7 clitic versions have clearly different senses 8 or subcategorization frames like in prenderle parseme-st-guidelines/1.1/?page=it- (‘to be beaten’) vs prendere (‘to take’). dectree sent. tokens VMWEs IAV IRV LS.ICV LVC.cause/full MVC VID VPC.full/semi IT-dev 917 32613 500 44 106 9 19/100 6 197 17/2 IT-train 13555 360883 3254 414 942 20 147/544 23 1098 66/0 IT-test 1256 37293 503 41 96 8 25/104 5 201 23/0 IT-Total 15728 430789 4257 499 7641 37 191/748 35 1496 106/2 Table 1: Statistics of the PARSEME-IT corpus version 1.1. #S #A1 #A2 Fspan κspan κcat PARSEME-IT-2017 2000 336 316 0.417 0.331 0.78 part of the VID. PARSEME-IT-2018 1000 341 379 0.586 0.550 0.882 • EXACT MATCHES UNLABELED, (18 cases) in Table 2: IAA scores for the PARSEME-IT corpus which the annotators agreed on the lexical- in versions from 2017 and 2018: #S is the number ized components of the VMWE to be anno- of sentences in the double-annotated corpus used tated but not the label. This type of disagree- for measuring the IAA. #A1 and #A2 refer to the ment is mainly related to fine-grained cate- number of VMWE instances annotated by each of gories such as LVC.cause and LVC.full as the annotators. Fspan is the F-measure for identi- in the case of dare . . . segnale (to give . . . fying the span of a VMWE, when considering that a signal) or VPC.full and VPC.semi as for one of the annotators tries to predict the other’s an- mettere insieme (‘to put together’) notations (VMWE categories are ignored). κspan • PARTIAL MATCHES UNLABELED, (1 case) and κcat are the values of Cohen’s κ for span iden- in which there is at least one token of the tification and categorization, respectively. VMWE in common between two annotators but the labels assigned are different, such as 5.1 Inter-annotator agreement in buttar-si in la calca (‘to join the crowd’) A small portion of the corpus consisting in 1,000 classified as VID by the first annotator and sentences was double-annotated. In compari- buttar-si (‘to throw oneself’) classified as son with the previous edition, the inter-annotator IRV by the second one in the following sen- agreement shown in Table 2 increased, although it tence: [. . . ] attendendo il venerdı̀ sera per is still not optimal.9 The improvement is probably buttarsi nella calca del divertimento [. . . ]. due to the fact that, this time, the group was based (‘waiting for the Friday evening to join the in one place with the exception of one annotator, crowd for entertainment’) and several meetings took place prior to the anno- • A NNOTATIONS CARRIED OUT ONLY BY tation phase in order to discuss the new guidelines. ONE OF THE ANNOTATORS: This is the cat- The two annotators involved in the IAA task an- egory which collects the most numerous ex- notated 191 VMWEs with no disagreement, but amples of disagremeent between annotators: there were several problems, which led to 44 cases 106 VMWE were annotated only by annota- of partial disagreement and 250 cases of total dis- tor 1 and 144 by annotator 2. agreement: 6 The systems and the results of the • PARTIAL MATCHES LABELED, (25 cases) shared task for the Italian language in which there is at least one token of the VMWE in common between two annotators Whereas only four systems took part in edition 1.0 and the labels assigned are the same. The of the shared task for the Italian language, in edi- disagreement mainly concerns the lexicalized tion 1.1, fourteen systems took on this challenge. elements as part of the VMWE, as in the case The system that took part in the PARSEME shared of the VID porre in cattiva luce (‘make look task are listed in Table 3: 12 took part in the closed bad’). Annotators disagreed, indeed, about track and two in the open one. The two systems considering the adjective cattiva (‘bad’) as that took part in the open track reported the re- sources that were used, namely SHOMA used pre- 9 As mentioned in Ramisch et al. (2018), the estimation of trained wikipedia word embeddings (Taslimipoor chance agreement in κspan and κcat is slightly different be- tween 2017 and 2018, therefore these results are not directly and Rohanian, 2018), while Deep-BGT (Berk comparable. et al., 2018) relied on the BIO tagging scheme and its variants (Schneider et al., 2014) to intro- duce additional tags to encode gappy (discontinu- ous) VMWEs. A distinctive characteristic of the systems of edition 1.1 is that most of them (GBD- NER-resplit and GBD-NER-standard, TRAPACC, and TRAPACC-S, SHOMA, Deep-BGT) use neu- ral networks, while the rest of the systems adopt other approaches: CRF-DepTree-categs and CRF- Seq-nocategs are based on a tree-structured CRF, MWETreeC and TRAVERSAL on syntactic trees and parsing methods, Polirem-basic and Polirem- rich on statistical methods and association mea- sures, and finally varIDE uses a Naive Bayes classifier. The systems were ranked according two types of evaluation measures (Ramisch et al., Table 3: Results for the Italian language 2018): a strict per-VMWE score (in which each VMWE in gold is either deemed predicted or not, in a binary fashion) and a fuzzy per-token score TRASVERSAL obtained overall better results for (which takes partial matches into account). For almost all VMWEs categories with the exception each of these two, precision (P), recall (R) and of VID and MVC, for which SHOMA showed a F1-scores (F) were calculated. Table 3 shows the better performance. ranking of the systems which participated in the shared task for the Italian language. The sys- tems with highest MWE-based Rank for Italian have F1 scores that are mostly comparable to the scores obtained in the General ranking of all lan- guages (e.g. TRAVERSAL had a General F1 of 54.0 vs Italian F1 of 49.2, being ranked first in both cases). Nevertheless, the Italian scores are consistently lower than the ones in the General ranking, even if only by a moderate margin, sug- gesting that Italian VMWEs in this specific corpus might be particularly harder to identify. One of the outliers in the table is MWETreeC, which predicts Figure 1: Chart comparing the MWE-based F1 much fewer VMWEs than in the annotated cor- scores for each label of the two best performing pora. This turned out to be true for other languages systems. as well. The few VMWEs that were predicted only obtained partial matches, which explains why its MWE-based score was 0. Another clear outlier is 7 Conclusions and future work Polirem-basic. Both Polirem-basic and Polirem- rich had predictions for Italian, French and Por- Having presented the results of the PARSEME tuguese. Their scores are somewhat comparable shared task edition 1.1, the paper described the in the three languages, suggesting that the lower advances achieved in this last edition in compar- scores are a characteristic of the system and not ison with the previous one, but also highlighted some artifact of the Italian corpus. that there is room for further improvements. We are working on some critical areas which emerged TRASVERSAL (Waszczuk, 2018) was the best during the annotation task in particular with refer- performing system in the closed track, while ence to some borderline cases and the refinement SHOMA (Taslimipoor and Rohanian, 2018) per- of the guidelines. Future work will focus on main- formed best in the open one. As shown in Fig- taining and increasing the quality and the size of ure 1, comparing the MWE-based F1 scores for the corpus but also on extending the shared task to each label for the two best performing systems, other MWE categories, such as nominal MWEs. 