Language resources for Italian: towards the development of a corpus of annotated Italian multiword expressions Shiva Taslimipoor Anna Desantis, Manuela Cherchi University of Wolverhampton, UK University of Sassari, Italy shiva.taslimi@wlv.ac.uk annadesantis_91@libero.it, manuealacherchi82@gmail.com Ruslan Mitkov Johanna Monti University of Wolverhampton, UK "L’Orientale" University of Naples, Italy r.mitkov@wlv.ac.uk jmonti@unior.it Abstract ever, despite being desiderata for linguistic anal- ysis and language learning, as well as for train- English. This paper describes the first ing and evaluation of NLP tasks such as term ex- resource annotated for multiword expres- traction (and Machine Translation in multilingual sions (MWEs) in Italian. Two versions of scenarios), resources annotated with MWEs are a this dataset have been prepared: the first scarce commodity (Schneider et al., 2014b). The with a fast markup list of out-of-context need for such types of resources is even greater for MWEs, and the second with an in-context Italian which does not benefit from the variety and annotation, where the MWEs are entered volume of resources as does English. with their contexts. The paper also dis- cusses annotation issues and reports the This paper outlines the development of a new inter-annotator agreement for both types language resource for Italian, namely a corpus an- of annotations. Finally, the results of notated with Italian MWEs of a particular class: the first exploitation of the new resource, verb-noun expressions such as fare riferimento, namely the automatic extraction of Italian dare luogo and prendere atto. Such colloca- MWEs, are presented. tions are reported to be the most frequent class of MWEs and of high practical importance both for Italiano. Questo contributo descrive automatic translation and language learning. To la prima risorsa italiana annotatata con the best of our knowledge, this is the first resource polirematiche. Sono state preparate due of this kind in Italian. versioni del dataset: la prima con una The development of this corpus is part of a mul- lista di polirematiche senza contesto, e tilingual project addressing the challenge of com- la seconda con annotazione in contesto. putational treatment of MWEs. It covers English, Il contributo discute le problematiche Spanish, Italian and French and its goal is to de- emerse durante l’annotazione e riporta velop a knowledge-poor methodology for auto- il grado di accordo tra annotatori per matically identifying MWEs and retrieving their entrambi i tipi di annotazione. Infine translations (Taslimipoor et al., 2016) for any pair vengono presentati i risultati del primo of languages. The developed methodology will impiego della nuova risorsa, ovvero be used for Machine Translation and multilin- l’estrazione automatica di polirematiche gual dictionary compilation, and also in computer- per l’italiano. aided tools to support the work of language learn- ers and translators. 1 Rationale Two versions of the above resource have been Multiword expressions (MWEs) are a pervasive produced. The first version consists of lists phenomenon in language with their computational of MWEs annotated out-of-context with a view treatment being crucial for users and NLP appli- to performing fast evaluation of the developed cations alike (Baldwin and Kim, 2010; Granger methodology (out-of-context mark-up). The sec- and Meunier, 2008; Monti et al., 2013; Monti and ond version consists of annotated MWEs along Todirascu, 2015; Seretan and Wehrli, 2013). How- with their concordances (in-context annotation). The latter type of annotation is time-consuming, MWEs in several languages. The shared task, but provides the contexts for the MWEs annotated. while having interesting discussions on the area, has embarked upon the labour-intensive annota- 2 Annotation of MWEs: out-of-context tion of verbal MWEs. mark-up and in-context annotation Since there is no list of verb-noun MWEs in After more than two decades of computational Italian, we first automatically compile a list of studies on MWEs, the lack of a proper gold stan- such expressions, to be annotated by human ex- dard is still an issue. Lexical resources like dic- perts. This is based on previous attempts at ex- tionaries have limited coverage of these expres- tracting a lexicon of MWEs (as in (Villavicencio, sions (Losnegaard et al., 2016) and there is also 2005)). Annotators are not provided with any con- no proper tagged corpus of MWEs in any language text and hence the task is more feasible in terms (Schneider et al., 2014b). of time. Human annotators are asked to label the Most previous studies on the computational expressions as MWEs only if they have sufficient treatment of MWEs have focused on extracting degrees of idiomaticity. In other words, a Verb + types (rather than tokens)1 of MWEs from corpora Noun MWEs does not convey literal meaning in (Ramisch et al., 2010; Villavicencio et al., 2007; that the verb is delexicalised. Rondon et al., 2015; Salehi and Cook, 2013). The However, we believe that idiomaticity is not a widely-used toolboxes of MWEToolkit (Ramisch binary property; rather it is known to fall on a con- et al., 2010) or Xtract (Smadja, 1993) extract ex- tinuum from completely semantically transparent, pressions if their statistical occurrences represent or literal, to entirely opaque, or idiomatic (Fa- the likelihood of them being MWEs. The evalu- zly et al., 2009). This makes the task of out-of- ation for the type-based extraction of MWEs has context marking-up of the expression more chal- been mostly performed against a dictionary (de lenging for annotators, since they have to pick a Caseli et al., 2010), lexicon (Pichotta and DeN- value according to all the possible contexts of a ero, 2013) or list of human-annotated expressions target expression. This ambiguity and the fact that (Villavicencio et al., 2007). However, there are there are many expressions that in some contexts some examples like the expression have a baby, are MWEs and in some contexts not, prompted us which in exactly the same form and structure, to initiate a subsequent annotation where MWEs might be an MWE (meaning to give birth ) in some are tagged in their contexts. The idea is to ex- contexts and a literal expression in others. tract the concordances around all the occurrences As for the automatic identification of the tokens of a Verb + Noun expression and provide annota- of MWEs, Fazly et al. (2009) make use of both tors with these concordances in order to be able linguistic properties and the local context, in de- to decide the degree of idiomaticity of the specific termining the class of an MWE token. They re- verb-noun expression. We compare the reliability port an unsupervised approach to identifying id- of the in-context and out-of-context annotations by iomatic and literal usages of an expression in con- way of the agreement between annotators. text. Their method is evaluated on a very small sample of expressions in a small portion of the 2.1 Experimental expressions British National Corpus (BNC), which were anno- Highly polysemous verbs, such as give and take tated by humans. Schneider et al. (2014a) devel- in English and fare and dare in Italian widely par- oped a supervised model whose purpose is to iden- ticipate in Verb+Noun MWEs, in which they con- tify MWEs in context. Their methodology results tribute a broad range of figurative meanings that in a corpus of automatically annotated MWEs. It must be recognised (Fazly et al., 2007). We fo- is not clear, however, if the methodology is able cus on four mostly frequent Italian verbs: fare, to tag one specific expression as an MWE in one dare, prendere and trovare. We extract all the oc- context and non-MWE in another. The PARSEME currences of these verbs when followed by any shared task2 is also devoted to annotating verbal noun, from the itWaC corpus (Baroni and Kilgar- 1 Type refers to the canonical form of an expression, while riff, 2006), using SketchEngine (Kilgarriff et al., token refers to each instance (usage) of the expression in any 2004). For the first experiment all the Verb+Noun morphological form in text. types are extracted when the verb is lemmatised; 2 http://typo.uni-konstanz.de/parseme/index.php/2-general/ 142-parseme-shared-task-on-automatic-detection-of-verbal-mwes and for the second experiment all the concor- dances of these verbs when followed by a noun Table 1: Annotation details (A: Annotator) are generated. Annotation tag 0 tag 1 tag 2 2.2 Out-of-context mark-up of Verb+Noun(s) task A (MWE) The extraction of Verb+Noun candidates of the 1st 2,491 792 92 Out-of-context nd four verbs in focus and the removal of the expres- 2 2,112 1,127 136 sions with frequencies lower than 20, results in a 1st 10,478 19,616 - In-context dataset of 3, 375 expressions. Two native speak- 2nd 9,058 21,036 - ers annotated every candidate expression with 1 for an MWE if the expression was idiomatic and Table 2: Inter-annotator agreement with 0 for a non-MWE if the expression was lit- Annotation Kappa Observed eral. We have also defined the tag 2 for the ex- task Agreement pressions that in some contexts behave as MWEs Out-of-context 0.40 0.73 and in others do not, e.g. dare frutti, which has In-context 0.65 0.85 a literal usage that means to produce fruits but in some contexts means to produce results and is an MWE in these contexts. While this out-of-context cerned with abstract nouns. The annotation of ex- ‘fast track’ annotation procedure saves time and pressions composed of a verb followed by a noun yields a long list of marked-up expressions, an- with an abstract meaning is a more complicated notators often feel uncomfortable due to the lack process as the candidate expression may carry a of context. The information about the agreements figurative meaning. Each annotator uses their in- between annotators in terms of Kappa is shown tuition to annotate them and it leads to random in Table 2 and is compared with the in-context an- tags for these expression (e.g. fare notizia, dare notation of MWEs as explained in Section 2.3. identità, prendere possesso) when they are out-of- context. However, in the case of in-context anno- 2.3 Annotating Verb+Noun(s) in context tation, concordances composed of abstract nouns We design an annotation task, in which we provide have been annotated in the majority of cases with a sample of all usages of any type of Verb+Noun 1 by both annotators. expression to be annotated. For this purpose, we In-context annotation is also very helpful for employ the SketchEngine to list all the concor- annotating expressions with both idiomatic and dances of each verb when it is followed by a noun. literal meanings. An interesting observation, re- Concordances include the verb in focus with al- ported in Table 3, is related to the number of ex- most ten words before and ten words after that. pressions that are detected with the two different The SketchEngine reports only 100, 000 concor- usages of idiomatic and non-idiomatic, in context. dances for each query. Among them, we filter out the concordances that include Verb+Noun expres- Table 3: Statistics on the in-context annotation sions with frequencies lower than 50 and we ran- domly select 10% of the concordances for each 0 tagged 1 tagged context verb. As a result, there are 30, 094 concordances depending to be annotated. The two annotators annotate all 1st annotator 924 195 530 usages of Verb+Noun expressions in these concor- 2nd annotator 696 424 529 dances, considering the context that the expression occurred in, marking up MWEs with 1 and expres- As can be seen in Table 3,3 among the 1, 649 sions which are not MWEs, with 0. Table 1 re- types of expressions in concordances, 530 (32%) ports on the details of annotation tasks and Table of them could be MWEs in some context and non- 2 shows the agreement details for them. MWEs in others (context-depending), according to the first annotator. This annotator has annotated 2.4 Discussion only 3% of the expressions with tag ‘2’ without As seen in Table 2, the inter-annotator agreement context. is significantly higher when annotating the expres- 3 Note that the numbers in Table 3 cannot be interpreted sions in context. One of the main causes of dis- to validate agreement between annotators, i.e. no conclusion agreements in out-of-context annotation is con- about agreement can be derived from 3. 3 First use of the MWE resource: Table 4: 11-p IAP Table 5: Accuracy of comparative evaluation of the for ranking MWEs AMs in classifying us- automatic extraction of Italian MWEs using different AMs ages of Verb+Noun(s). In our multilingual project (see Section 1) we re- AMs 11-p IAP AMs Accuracy gard the automatic translation of MWEs as a two- Freq 0.49 Freq 0.72 stage process. The first stage is the extraction of MI3 0.51 MI3 0.68 MWEs in each of the languages; the second stage log-likelihood 0.49 log-likelihood 0.72 is a matching procedure for the extracted MWEs in Salience 0.49 Salience 0.69 each language which proposes translation equiv- log-dice 0.48 log-dice 0.67 alents. In this study the extraction of MWEs is T-Score 0.49 T-Score 0.69 based on statistical association measures (AMs). These measures have been proposed to deter- mine the degree of compositionality, and fixedness as the highest precision found for any recall level of expressions. The more compositional or fixed r0 ≥ r. The average of these 11 points is reported expressions are, the more likely it is that they are as 11-p IAP in Table 4. MWEs (Evert, 2008; Bannard, 2007). According As can be seen in Table 4, the selected asso- to Evert (2008), there is no ideal association mea- ciation measures generally perform with similar sure for all purposes. We aim to evaluate AMs performance in ranking this type of MWEs, with as a baseline approach against the annotated data M I3 performing slightly better than others. which we prepared. We focus on a selection of 3.2 Experiments on token-based five AMs which have been more widely discussed identification of MWEs to be the best measures to identify MWEs. These are: MI3 (Oakes, 1998), log-likelihood (Dun- In the second experiment, we seek to establish ning, 1993), T-score (Krenn and Evert, 2001), log- the effect of these measures on identifying the us- Dice (Rychlý, 2008) and Salience (Kilgarriff et al., ages of MWEs in our dataset of in-context an- 2004) all as defined in SketchEngine. We compare notations. We set a threshold for each score the performance of these AMs and also frequency that we have computed for Verb+Noun expres- of occurrence (Freq) as the sixth measure to rank sion types. By setting thresholds we compute the the candidate MWEs. We evaluate the effect of classification accuracy of the measures to iden- these measures in ranking MWEs on both kinds of tify MWEs among the usages of Verb+Noun ex- datasets. pressions in a corpus. Specifically, each candidate of a Verb+Noun in the concordances is automat- 3.1 Experiments on type-based extraction of ically tagged as an MWE if its lemmatised form MWEs has a score higher than the threshold, and as a non- In the first experiment, the list of all extracted Verb MWE, otherwise. For each measure, we compute + Noun combinations (as explained in Section 2.1) the arithmetic mean (average) of all the values of are ranked according to the above measures that that measure for all expressions, and set the re- are computed from itWaC as a reference corpus. sulted average value as a threshold. To perform the evaluation against the list of an- The accuracies of classifying the candidate notated expressions, we process all 2,415 expres- Verb+Noun expressions are computed based on sions for which the annotators agreed on tags 0 the human annotations of the concordances and or 1. After ranking the expressions by the mea- are shown in Table 5. The classification accura- sures, we examine the retrieval performance of cies of AMs are also very close to each other (see each measure by computing the 11-point Interpo- Table 5); however, this time Log-likelihood and lated Average Precision (11-p IAP). This reflects F req fare slightly better than others in classifying the goodness of a measure in ranking the relevant tokens of Verb+Noun expressions. items (here, MWEs) before the irrelevant ones. To this end, the interpolated precision at the 11 re- 3.3 Usage-related features call values of 0, 10%, ..., 100% is calculated. As Our new resource of concordances contains use- detailed in Manning et al. (2008), the interpo- ful linguistic information related to usages of ex- lated precision at a certain recall level, r, is defined pressions and as such important features can be extracted from the resource to help identifying References MWEs. One of these features can be obtained Timothy Baldwin and Su Nam Kim. 2010. Multi- from the statistics of different possible inflections word expressions. In Handbook of Natural Lan- of the verb component of an expression. Based on guage Processing, second edition., pages 267–292. the premise of the fixedness of MWEs, we expect CRC Press. that the verb component of a verb-noun MWE oc- Colin Bannard. 2007. A measure of syntactic flexibil- curs only in a limited number of inflections. We ity for automatically identifying multiword expres- implement this feature by dividing the frequency sions in corpora. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions, of occurrences of each expression by the number pages 1–8. Association for Computational Linguis- of inflections that the verb component occurs in. tics. Note that to count the number of different inflec- Marco Baroni and Adam Kilgarriff. 2006. Large tions of the verb component, we rely on the sub- linguistically-processed web corpora for multiple corpus of concordances that we gathered. languages. In Proceedings of the Eleventh Confer- We evaluate this approach only on 1,077 ex- ence of the European Chapter of the Association for Computational Linguistics: Demonstrations, EACL pressions that occur in concordances. We rank ’06, pages 87–90, Stroudsburg, PA, USA. Associa- the expressions according to this newly computed tion for Computational Linguistics. score and we call this score, which depends on the inflection varieties, INF-VAR. For all verbs, the Helena Medeiros de Caseli, Carlos Ramisch, Maria das Graças Volpe Nunes, and Aline Villavicencio. 2010. INF-VAR performs comparably to Frequency in Alignment-based extraction of multiword expres- ranking MWEs higher than non-MWEs, but for sions. Language resources and evaluation, 44(1- the verb trovare, we obtain better 11-p IAP using 2):59–77. this score than by using Frequency (see Table 6). Ted Dunning. 1993. Accurate methods for the statistics of surprise and coincidence. COMPUTA- Table 6: Performance of new scores in ranking TIONAL LINGUISTICS, 19(1):61–74. MWEs in terms of 11-p IAP. Stefan Evert. 2008. Corpora and collocations. In Cor- pus Linguistics. An International Handbook, vol- total trovare ume 2, pages 1212–1248. Frequency 0.57 0.44 Afsaneh Fazly, Suzanne Stevenson, and Ryan North. INF-VAR 0.58 0.48 2007. Automatically learning semantic knowledge about multiword predicates. Language Resources and Evaluation, 41(1):61–89. Afsaneh Fazly, Paul Cook, and Suzanne Stevenson. 4 Conclusions and future work 2009. Unsupervised type and token identification of idiomatic expressions. Computational Linguistics, In this paper, we outline our work towards a gold- 35(1):61–103. standard dataset which is tagged with Italian verb- Sylviane Granger and Fanny Meunier. 2008. Phrase- noun MWEs along with their contexts. We show ology: an interdisciplinary perspective. John Ben- the reliability of this dataset by its considerable jamins Publishing Company. inter-annotator agreement compared to the moder- Adam Kilgarriff, Pavel Rychlý, Pavel Smrz, and David ate inter-annotator agreement on annotated verb- Tugwell. 2004. The sketch engine. In EURALEX noun expressions presented without context. We 2004, pages 105–116, Lorient, France. also report the results of automatic extraction of Brigitte Krenn and Stefan Evert. 2001. Can we do MWEs using this dataset as a gold-standard. One better than frequency? a case study on extracting of the advantages of this dataset is that it includes pp-verb collocations. Proceedings of the ACL Work- both 0-tagged and 1-tagged tokens of expressions shop on Collocations, pages 39–46. and it can be used for classification and other sta- Gyri Smørdal Losnegaard, Federico Sangati, tistical NLP approaches. In future work, we are Carla Parra Escartín, Agata Savary, Sascha interested in extracting context features from con- Bargmann, and Johanna Monti. 2016. Parseme cordances in this resource to automatically recog- survey on mwe resources. In Proceedings of the Tenth International Conference on Language nise and classify the expressions that are MWEs in Resources and Evaluation (LREC 2016), Paris, some contexts but not MWEs in others. France. European Language Resources Association (ELRA). Christopher D Manning, Prabhakar Raghavan, and Violeta Seretan and Eric Wehrli. 2013. Syntactic Hinrich Schütze. 2008. Introduction to information concordancing and multi-word expression detection. retrieval. Cambridge University Press. International Journal of Data Mining, Modelling and Management, 5(2):158–181. Johanna Monti and Amalia Todirascu. 2015. Multi- word units translation evaluation in machine trans- Frank Smadja. 1993. Retrieving collocations from lation: another pain in the neck? In Proceedings text: Xtract. Computational Linguistics, 19:143– of MUMTTT workshop, Corpas Pastor G, Monti J, 177. Mitkov R, Seretan V (eds) (2015), Multi-word Units in Machine Translation and Translation Technology. Shiva Taslimipoor, Ruslan Mitkov, Gloria Corpas Pas- tor, and Afsaneh Fazly. 2016. Bilingual contexts from comparable corpora to mine for translations of Johanna Monti, Ruslan Mitkov, Gloria Corpas Pastor, collocations. In Proceedings of the 17th Interna- and Violeta Seretan. 2013. Multi-word units in ma- tional Conference on Intelligent Text Processing and chine translation and translation technologies. Computational Linguistics, CICLing’16. Springer. Michael P. Oakes. 1998. Statistics for Corpus Linguis- Aline Villavicencio, Valia Kordoni, Yi Zhang, Marco tics. Edinburgh: Edinburgh University Press. Idiart, and Carlos Ramisch. 2007. Validation and evaluation of automatically acquired multiword ex- Karl Pichotta and John DeNero. 2013. Identify- pressions for grammar engineering. In EMNLP- ing phrasal verbs using many bilingual corpora. In CoNLL, pages 1034–1043. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP Aline Villavicencio. 2005. The availability of verb– 2013), Seattle, WA, October. particle constructions in lexical resources: How much is enough? Computer Speech & Language, Carlos Ramisch, Aline Villavicencio, and Christian 19(4):415–432. Boitet. 2010. mwetoolkit: a Framework for Mul- tiword Expression Identification. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Valetta, Malta, May. European Language Resources Asso- ciation. Alexandre Rondon, Helena Caseli, and Carlos Ramisch. 2015. Never-ending multiword expres- sions learning. In Proceedings of the 11th Work- shop on Multiword Expressions, pages 45–53, Den- ver, Colorado, June. Association for Computational Linguistics. Pavel Rychlý. 2008. A lexicographer-friendly asso- ciation score. In RASLAN 2008, pages 6–9, Brno. Masarykova Univerzita. Bahar Salehi and Paul Cook. 2013. Predicting the compositionality of multiword expressions us- ing translations in multiple languages. Second Joint Conference on Lexical and Computational Seman- tics (* SEM), 1:266–275. Nathan Schneider, Emily Danchik, Chris Dyer, and Noah A. Smith. 2014a. Discriminative lexical se- mantic segmentation with gaps: Running the MWE gamut. TACL, 2:193–206. Nathan Schneider, Spencer Onuffer, Nora Kazour, Emily Danchik, Michael T. Mordowanec, Henrietta Conrad, and Noah A. Smith. 2014b. Comprehen- sive annotation of multiword expressions in a so- cial web corpus. In Proceedings of the Ninth In- ternational Conference on Language Resources and Evaluation (LREC’14), pages 455–461, Reykjavik, Iceland. European Language Resources Association (ELRA).