-

10.4000/books.aaccademia

Identification of Multiword Expressions: comparing the performance of a Conditional Random Fields model on corpora of written and spoken Italian

Ilaria Manfredi

Lorenzo Gregori

0 0 University of Florence , P.zza San Marco 4, 50121 Florence , Italy

2018

2769 10 12

This paper describes an experiment that compares the performance of a Conditional Random Fields model on identification of Multiword expressions in corpora of spoken and written Italian. The model is trained on a corpus of spoken language and a corpus of written language annotated with Multiword expressions, then tested on two other corpora (one written and one spoken). This methodology provides very good results regarding Precision.

eol>Multiword Expressions Conditional Random Fields Spoken corpora

1. Introduction methodology followed to annotate the training corpora with MWEs and the testing; results of the experiment are presented in § 5 and discussed in § 6. "Multiword expression" (MWE) is a term used to refer to groups of words that display formal or functional idiosyncratic properties with respect to free word combinations, and therefore behave like a unit [1]. This notion en- 2. Related work compasses a wide set of linguistic phenomena, of both semantic and syntactic nature, like idioms, verb-particle Identification of MWEs in corpora is essential for various constructions, complex nominals, and support verb con- NLP tasks such as machine translation and parsing, so a structions. The computational treatment of MWEs no- lot of research has been done on automatic acquisition toriously poses a challenge in NLP [2], but in recent of MWEs, both in general and for specific languages [ 4]. years a lot of efort has been put into the development Many studies have explored the use of Association Meaof techniques and tools for the identification of MWEs sures for MWEs identification [ 5, 6, 7]; methodologies in corpora. These are almost exclusively derived from, based on parallel corpora have also been investigated [8]. and tested on, written corpora. This leaves the study of More recently, the use of diferent AI models has been MWEs in spoken varieties of languages, including Italian, tested for this task [9, 10]. Among these, CRF models a rather unexplored field. has been used successfully in NLP for various sequence

Given the major diferences between spoken and writ- labeling tasks, including MWEs identification [ 11, 12, 13]. ten language, we deemed it important to establish how Given that, we have decided to use one of the CRF modan MWEs automatic extraction tool trained on written els available for our experiment (see § 4). As already corpus performs on a spoken one, also considering the mentioned all of these studies have been conducted on lack of specific resources for spoken corpora. We have written corpora only, and so are the resources derived decided to conduct an experiment training a Conditional (mainly MWE annotated corpora and gold standard lists). Random Fields (CRF) model [3] to identify MWEs. The As for MWEs in spoken corpora, Strik et al. investimodel was trained on both a corpus of spoken and one gated possible ways of automatically identifying MWEs of written Italian; the two models obtained were then in Dutch speech corpora based on pronunciation chartested on corpora of spoken and written Italian, and their acteristics; Trotta et al. built PoliSdict, a dictionary of performances were evaluated. In § 2 we give an overview Italian MWEs extracted from a corpus of political speech. of existing research on MWEs and related resources for To the best of our knowledge, this is the only resource Italian; in § 3 we describe the resources used to build of speech language MWEs existing for Italian. Other the training and test corpora; in § 4 we described the resources for Italian MWEs are PARSEME-It, a written corpus annotated with verbal MWEs [16, 17], and a validated dataset of MWEs from written corpora compiled by Masini et al. [19].

This brief overview highlights the gap in existing litCLiC-it 2023: 9th Italian Conference on Computational Linguistics, Nov 30 — Dec 02, 2023, Venice, Italy $ ilariamanfredi3@gmail.com (I. Manfredi); lorenzo.gregori@unifi.it (L. Gregori)

© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License erature regarding MWEs from spoken language; hence, CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) our experiment seeks to evaluate the performance of one of the tools available, up to now tested only on written corpora.

3. Resources For the experiment, we have used two training corpora

and two test corpora (described in § 4.1) derived from the following resources.

KIParla [20] is a spoken corpus containing more than 112 hours of speech recorded in various settings from speakers of diferent areas of Italy, and is currently composed of two modules. The KIP module [21] contains speech of students and professors recorded in the Universities of Bologna and Turin.

IMAGACT is a corpus of approximately 1.8 million tokens1 used for the creation of the IMAGACT Visual Ontology resource [22]; it contains texts of spoken Italian derived from LABLITA Corpus of Spontaneous Italian, LIP corpus, and the spoken section of CLIPS corpus. The materials contained are heterogeneous from a diaphasic, diastratic, and diatopic point of view (see Gagliardi for a detailed description).

CorDIC-scritto is a web corpus created within the RIDIRE project [24] containing written texts pertaining to five diferent semantic and functional domains: creative, bureaucratic, news, arts, economy2.

PAISÀ [25] is a web corpus of approximately 250 million tokens containing documents from web pages. Part of the documents was obtained by retrieving pages using pairs words from the Italian basic vocabulary list as queries; others were derived from the Italian versions of various Wikimedia Foundation projects.

4. Methodology This work has been conducted making use of the

mwetoolkit software [26] for the extracting, filtering and annotating of the MWEs; the CRF model we have used is the one implemented in the CRFsuite software [27] and provided within the toolkit.

4.1. Training and test corpora

We have used the KIP module3 of KIParla as the spoken training corpus and CorDIC-scritto as the written training corpus. As the spoken test corpus we have used IMAGACT. Lastly, for the written test corpus we have sampled PAISÀ to have approximately the same number

1Here tokens are intended as single graphic units that include punc

tuation, symbols and words, as usual in computational linguistics 2See http://cordic.lablita.it/ 3Compared to the original resource, available on https://kiparla.it/search/, our corpus lacks the doc- 4https://home.sslmit.unibo.it/ baroni/collocazioni/itwac.tagset.txt. uments BOC1006, BOD2008, TOA3005, TOD1005bis. 5http://www.italianlp.it/docs/ISST-TANL-POStagset.pdf Spoken training Written training Spoken test Written test

Name

KIP CorDIC

Words 559,816 502,665

Tokens

4.2. Annotation of the training corpora

The first step to annotate the training corpora was the extraction of candidates, obtained by searching the corpora with sets of POS-patterns (see Ramish and Lenci et al. for an assessment of the method). The chosen POSpatterns were derived from the work of Masini et al., who provided a dataset of 1682 validated Italian MWEs extracted from written corpora with the POS-pattern method. We chose to use the top 20 POS-patterns in the dataset ranked by number of MWEs. Since the patterns in the dataset are provided according to the ISST-Tanl tagset5, we first "translated" the tags to their respective ones in Baroni’s tagset. The tagsets are not symmetrical (for example ISST-Tanl tags RD ’determinative article’ and RI ’indeterminative article’ are both ART ’article’ in Baroni’s tagset) so we computed again frequency of MWEs for each pattern and then took the top 20. The 20 POS-patterns used are bigrams and trigrams of adjectival, nominal, verbal, adverbial and prepositional patterns.

Using mwetoolkit functions, the corpora were searched and for every POS-pattern a list of candidates was obtained; each corpus was searched independently and the lists of candidates were examined separately. As a second step, all the lists of the candidates were filtered by number of occurrences: only candidates with a frequency of 4 or more were kept. Lists containing a high number of candidates were further filtered, before being manually examined: for KIP, lists having more than 150 candidates were ranked by LogLikelihood and the top 100 were examined; for CorDIC, lists with more than

4.3. Training and testing The model was trained on MWE annotated KIP

and CorDIC independently, using the functions of mwetoolkit; the training script was not modified and the features were kept as provided7.

So we obtained two models, one trained on KIP (the ’spoken model’) and one trained on CorDIC (the ’written model’). We used each of them to identify MWEs from IMAGACT and PAISÀ, with the aim to compare the results and determine if the best performance on spoken corpus comes from a spoken o written model, and vice versa.

5. Results

100 candidates were ranked by LogLikelihood6 and the The spoken model tagged 7508 occurrences of MWEs in top 100 were examined. In lists having less candidates IMAGACT and 3337 in PAISÀ; the written model tagged than that, all of the candidates were examined. This way 5047 occurrences of MWEs in PAISÀ and 6291 in IMAthere is approximately the same number of candidates to GACT. For a full evaluation of the models we need to be examined for each corpus: 1496 for KIP and 1584 for compute Precision and Recall of the annotated corpora. CorDIC. Computation of Recall needs all the false negatives in

Table 2 shows, for each POS-pattern, the number test corpora to be identified; for that, we would need to of candidates with frequency > 3 in KIP (candK) and manually annotate the entire corpora which is a very CorDIC (candC) and the number of candidates examined time-consuming task that requires multiple trained annoin each corpus (anK and anC). POS are abbreviated like tators. Another element of complexity for this task is to this: A = adjective, N = noun, Pre-Art = articulated prepo- provide annotators with a precise definition of what to sition, Pre = preposition, V = verb, Art= article, DInd = consider a MWE, as the distinction between MWEs and indefinite determiner, Adv = adverb. other types of word combinations is not always clear-cut.

As the final step, the remaining candidates from all the So, evaluation has been performed by manually comlists were manually examined. Candidates who showed puting Precision on a sample of 500 MWEs from each some type of idiomaticity, fixedness, or were character- batch of results. Table 3 shows occurrences of MWEs and ized by high familiarity of use were annotated as MWEs: Precision at 500 for spoken and written models on each in total, 214 MWEs for KIP and 204 for CORDIC. MWEs corpus. were tagged in their respective corpora using the IOB format [32]. In this process, attention has been put to only tag MWEs when they are in an idiomatic context, 6. Discussion and not where they have a literal meaning.

Results obtained show a great performance overall for

both of the models, given the high value for Precision for all four of the corpora tagged. However, considering also the number of MWE occurrences tagged, we

6To calculate LogLikelihood for trigrams we have used the Ngram

Statistics Package [30, 31]

7See https://gitlab.com/mwetoolkit/mwetoolkit3/

/blob/master/resources/default-config/listFeatures.txt can see that the spoken model performed the worst on PAISÀ, having the lowest Precision and number of occurrences, while better results are achieved on the same corpus by the written model. On IMAGACT, both of the models performed very well, with the written model having the best Precision overall but slightly fewer occurrences of MWEs found. We have also counted the number of MWEs tagged (per lemmas) in IMAGACT, and how many of these were "new" compared to the ones annotated in the training corpora. The spoken model tagged 222 MWEs (per lemmas) of which 63 were new (28.4%) and the written model tagged 224 MWEs (per lemmas), 64 being new (28.6%), so the models performed similarly in this regard too. A slight diference in performance can be noted comparing Precision in tagging new MWEs: new MWEs found by spoken model account for a total of 119 occurrences, 46 of which results correctly tagged; new MWEs found by written model account for 123 occurrences, 60 of which are correctly tagged.

In conclusion, the results of this experiment show that on spoken corpora ’written models’ perform similarly to ’spoken models’; this looks really promising, considering the lack of resources dedicated to MWEs in spoken language. Future works in this line of research include the computing of Recall for the models and qualitative evaluation of the MWEs extracted. S. Castagnoli, F. Dell’Orletta, H. Dittmann, A. Lenci, V. Pirrelli, The PAISÀ corpus of Italian web texts, in: Proceedings of the 9th Web as Corpus Workshop (WaC-9), Association for Computational Linguistics, Gothenburg, Sweden, 2014, pp. 36–43. doi:10.3115/v1/W14-0406. [26] C. Ramisch, A. Villavicencio, C. Boitet, mwetoolkit: a framework for multiword expression identification, in: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), European Language Resources Association (ELRA), Valletta, Malta, 2010. [27] N. Okazaki, Crfsuite: a fast implementation of conditional random fields (crfs), 2007. URL: http: //www.chokkan.org/software/crfsuite/. [28] H. Schmid, Probabilistic part-of-speech tagging using decision trees, in: Proceedings of the International Conference on New Methods in Language Processing, 1994. [29] A. Lenci, F. Masini, M. Nissim, S. Castagnoli, G. Lebani, L. Passaro, M. Senaldi, How to harvest word combinations from corpora: Methods, evaluation and perspectives, Studi e saggi linguistici 55 (2017) 45–68. [30] S. Banerjee, T. Pedersen, The design, implementation, and use of the ngram statistics package, in: Computational Linguistics and Intelligent Text Processing, volume 2000, 2003, pp. 370–381. doi:10. 1007/3-540-36456-0_38. [31] T. Pedersen, S. Banerjee, B. McInnes, S. Kohli, M. Joshi, Y. Liu, The ngram statistics package (text::NSP) : A flexible tool for identifying ngrams, collocations, and word associations, in: Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World, Association for Computational Linguistics, Portland, Oregon, 2011, pp. 131–133. [32] L. Ramshaw, M. Marcus, Text chunking using transformation-based learning, in: Third Workshop on Very Large Corpora, 1995.