<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.4000/books.aaccademia</article-id>
      <title-group>
        <article-title>Identification of Multiword Expressions: comparing the performance of a Conditional Random Fields model on corpora of written and spoken Italian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ilaria Manfredi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorenzo Gregori</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Florence</institution>
          ,
          <addr-line>P.zza San Marco 4, 50121 Florence</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>2769</volume>
      <fpage>10</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>This paper describes an experiment that compares the performance of a Conditional Random Fields model on identification of Multiword expressions in corpora of spoken and written Italian. The model is trained on a corpus of spoken language and a corpus of written language annotated with Multiword expressions, then tested on two other corpora (one written and one spoken). This methodology provides very good results regarding Precision.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multiword Expressions</kwd>
        <kwd>Conditional Random Fields</kwd>
        <kwd>Spoken corpora</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction
methodology followed to annotate the training corpora
with MWEs and the testing; results of the experiment are
presented in § 5 and discussed in § 6.
"Multiword expression" (MWE) is a term used to refer to
groups of words that display formal or functional
idiosyncratic properties with respect to free word combinations,
and therefore behave like a unit [1]. This notion en- 2. Related work
compasses a wide set of linguistic phenomena, of both
semantic and syntactic nature, like idioms, verb-particle Identification of MWEs in corpora is essential for various
constructions, complex nominals, and support verb con- NLP tasks such as machine translation and parsing, so a
structions. The computational treatment of MWEs no- lot of research has been done on automatic acquisition
toriously poses a challenge in NLP [2], but in recent of MWEs, both in general and for specific languages [ 4].
years a lot of efort has been put into the development Many studies have explored the use of Association
Meaof techniques and tools for the identification of MWEs sures for MWEs identification [ 5, 6, 7]; methodologies
in corpora. These are almost exclusively derived from, based on parallel corpora have also been investigated [8].
and tested on, written corpora. This leaves the study of More recently, the use of diferent AI models has been
MWEs in spoken varieties of languages, including Italian, tested for this task [9, 10]. Among these, CRF models
a rather unexplored field. has been used successfully in NLP for various sequence</p>
      <p>Given the major diferences between spoken and writ- labeling tasks, including MWEs identification [ 11, 12, 13].
ten language, we deemed it important to establish how Given that, we have decided to use one of the CRF
modan MWEs automatic extraction tool trained on written els available for our experiment (see § 4). As already
corpus performs on a spoken one, also considering the mentioned all of these studies have been conducted on
lack of specific resources for spoken corpora. We have written corpora only, and so are the resources derived
decided to conduct an experiment training a Conditional (mainly MWE annotated corpora and gold standard lists).
Random Fields (CRF) model [3] to identify MWEs. The As for MWEs in spoken corpora, Strik et al.
investimodel was trained on both a corpus of spoken and one gated possible ways of automatically identifying MWEs
of written Italian; the two models obtained were then in Dutch speech corpora based on pronunciation
chartested on corpora of spoken and written Italian, and their acteristics; Trotta et al. built PoliSdict, a dictionary of
performances were evaluated. In § 2 we give an overview Italian MWEs extracted from a corpus of political speech.
of existing research on MWEs and related resources for To the best of our knowledge, this is the only resource
Italian; in § 3 we describe the resources used to build of speech language MWEs existing for Italian. Other
the training and test corpora; in § 4 we described the resources for Italian MWEs are PARSEME-It, a written
corpus annotated with verbal MWEs [16, 17], and a
validated dataset of MWEs from written corpora compiled
by Masini et al. [19].</p>
      <p>This brief overview highlights the gap in existing
litCLiC-it 2023: 9th Italian Conference on Computational Linguistics,
Nov 30 — Dec 02, 2023, Venice, Italy
$ ilariamanfredi3@gmail.com (I. Manfredi);
lorenzo.gregori@unifi.it (L. Gregori)</p>
      <p>© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License erature regarding MWEs from spoken language; hence,
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) our experiment seeks to evaluate the performance of one
of the tools available, up to now tested only on written
corpora.</p>
    </sec>
    <sec id="sec-2">
      <title>3. Resources</title>
      <sec id="sec-2-1">
        <title>For the experiment, we have used two training corpora</title>
        <p>and two test corpora (described in § 4.1) derived from
the following resources.</p>
        <p>KIParla [20] is a spoken corpus containing more than
112 hours of speech recorded in various settings from
speakers of diferent areas of Italy, and is currently
composed of two modules. The KIP module [21] contains
speech of students and professors recorded in the
Universities of Bologna and Turin.</p>
        <p>IMAGACT is a corpus of approximately 1.8 million
tokens1 used for the creation of the IMAGACT Visual
Ontology resource [22]; it contains texts of spoken Italian
derived from LABLITA Corpus of Spontaneous Italian,
LIP corpus, and the spoken section of CLIPS corpus. The
materials contained are heterogeneous from a diaphasic,
diastratic, and diatopic point of view (see Gagliardi for a
detailed description).</p>
        <p>CorDIC-scritto is a web corpus created within the
RIDIRE project [24] containing written texts pertaining
to five diferent semantic and functional domains:
creative, bureaucratic, news, arts, economy2.</p>
        <p>PAISÀ [25] is a web corpus of approximately 250
million tokens containing documents from web pages. Part
of the documents was obtained by retrieving pages
using pairs words from the Italian basic vocabulary list as
queries; others were derived from the Italian versions of
various Wikimedia Foundation projects.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Methodology</title>
      <sec id="sec-3-1">
        <title>This work has been conducted making use of the</title>
        <p>mwetoolkit software [26] for the extracting, filtering
and annotating of the MWEs; the CRF model we have
used is the one implemented in the CRFsuite software
[27] and provided within the toolkit.</p>
        <sec id="sec-3-1-1">
          <title>4.1. Training and test corpora</title>
          <p>We have used the KIP module3 of KIParla as the
spoken training corpus and CorDIC-scritto as the written
training corpus. As the spoken test corpus we have used
IMAGACT. Lastly, for the written test corpus we have
sampled PAISÀ to have approximately the same number</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>1Here tokens are intended as single graphic units that include punc</title>
        <p>tuation, symbols and words, as usual in computational linguistics
2See http://cordic.lablita.it/
3Compared to the original resource, available on
https://kiparla.it/search/, our corpus lacks the doc- 4https://home.sslmit.unibo.it/ baroni/collocazioni/itwac.tagset.txt.
uments BOC1006, BOD2008, TOA3005, TOD1005bis. 5http://www.italianlp.it/docs/ISST-TANL-POStagset.pdf
Spoken
training
Written
training
Spoken
test
Written
test</p>
        <p>Name</p>
        <p>KIP
CorDIC</p>
        <p>Words
559,816
502,665</p>
        <p>Tokens</p>
        <sec id="sec-3-2-1">
          <title>4.2. Annotation of the training corpora</title>
          <p>The first step to annotate the training corpora was the
extraction of candidates, obtained by searching the
corpora with sets of POS-patterns (see Ramish and Lenci
et al. for an assessment of the method). The chosen
POSpatterns were derived from the work of Masini et al.,
who provided a dataset of 1682 validated Italian MWEs
extracted from written corpora with the POS-pattern
method. We chose to use the top 20 POS-patterns in the
dataset ranked by number of MWEs. Since the patterns
in the dataset are provided according to the ISST-Tanl
tagset5, we first "translated" the tags to their respective
ones in Baroni’s tagset. The tagsets are not symmetrical
(for example ISST-Tanl tags RD ’determinative article’
and RI ’indeterminative article’ are both ART ’article’
in Baroni’s tagset) so we computed again frequency of
MWEs for each pattern and then took the top 20. The 20
POS-patterns used are bigrams and trigrams of adjectival,
nominal, verbal, adverbial and prepositional patterns.</p>
          <p>Using mwetoolkit functions, the corpora were
searched and for every POS-pattern a list of candidates
was obtained; each corpus was searched independently
and the lists of candidates were examined separately. As
a second step, all the lists of the candidates were filtered
by number of occurrences: only candidates with a
frequency of 4 or more were kept. Lists containing a high
number of candidates were further filtered, before being
manually examined: for KIP, lists having more than 150
candidates were ranked by LogLikelihood and the top
100 were examined; for CorDIC, lists with more than</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>4.3. Training and testing</title>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>The model was trained on MWE annotated KIP</title>
        <p>and CorDIC independently, using the functions of
mwetoolkit; the training script was not modified and
the features were kept as provided7.</p>
        <p>So we obtained two models, one trained on KIP (the
’spoken model’) and one trained on CorDIC (the ’written
model’). We used each of them to identify MWEs from
IMAGACT and PAISÀ, with the aim to compare the
results and determine if the best performance on spoken
corpus comes from a spoken o written model, and vice
versa.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Results</title>
      <p>100 candidates were ranked by LogLikelihood6 and the The spoken model tagged 7508 occurrences of MWEs in
top 100 were examined. In lists having less candidates IMAGACT and 3337 in PAISÀ; the written model tagged
than that, all of the candidates were examined. This way 5047 occurrences of MWEs in PAISÀ and 6291 in
IMAthere is approximately the same number of candidates to GACT. For a full evaluation of the models we need to
be examined for each corpus: 1496 for KIP and 1584 for compute Precision and Recall of the annotated corpora.
CorDIC. Computation of Recall needs all the false negatives in</p>
      <p>Table 2 shows, for each POS-pattern, the number test corpora to be identified; for that, we would need to
of candidates with frequency &gt; 3 in KIP (candK) and manually annotate the entire corpora which is a very
CorDIC (candC) and the number of candidates examined time-consuming task that requires multiple trained
annoin each corpus (anK and anC). POS are abbreviated like tators. Another element of complexity for this task is to
this: A = adjective, N = noun, Pre-Art = articulated prepo- provide annotators with a precise definition of what to
sition, Pre = preposition, V = verb, Art= article, DInd = consider a MWE, as the distinction between MWEs and
indefinite determiner, Adv = adverb. other types of word combinations is not always clear-cut.</p>
      <p>As the final step, the remaining candidates from all the So, evaluation has been performed by manually
comlists were manually examined. Candidates who showed puting Precision on a sample of 500 MWEs from each
some type of idiomaticity, fixedness, or were character- batch of results. Table 3 shows occurrences of MWEs and
ized by high familiarity of use were annotated as MWEs: Precision at 500 for spoken and written models on each
in total, 214 MWEs for KIP and 204 for CORDIC. MWEs corpus.
were tagged in their respective corpora using the IOB
format [32]. In this process, attention has been put to
only tag MWEs when they are in an idiomatic context, 6. Discussion
and not where they have a literal meaning.</p>
      <sec id="sec-4-1">
        <title>Results obtained show a great performance overall for</title>
        <p>both of the models, given the high value for Precision
for all four of the corpora tagged. However,
considering also the number of MWE occurrences tagged, we</p>
      </sec>
      <sec id="sec-4-2">
        <title>6To calculate LogLikelihood for trigrams we have used the Ngram</title>
        <p>Statistics Package [30, 31]</p>
      </sec>
      <sec id="sec-4-3">
        <title>7See https://gitlab.com/mwetoolkit/mwetoolkit3/</title>
        <p>/blob/master/resources/default-config/listFeatures.txt
can see that the spoken model performed the worst on
PAISÀ, having the lowest Precision and number of
occurrences, while better results are achieved on the same
corpus by the written model. On IMAGACT, both of
the models performed very well, with the written model
having the best Precision overall but slightly fewer
occurrences of MWEs found. We have also counted the
number of MWEs tagged (per lemmas) in IMAGACT, and
how many of these were "new" compared to the ones
annotated in the training corpora. The spoken model
tagged 222 MWEs (per lemmas) of which 63 were new
(28.4%) and the written model tagged 224 MWEs (per
lemmas), 64 being new (28.6%), so the models performed
similarly in this regard too. A slight diference in
performance can be noted comparing Precision in tagging new
MWEs: new MWEs found by spoken model account for
a total of 119 occurrences, 46 of which results correctly
tagged; new MWEs found by written model account for
123 occurrences, 60 of which are correctly tagged.</p>
        <p>In conclusion, the results of this experiment show that
on spoken corpora ’written models’ perform similarly to
’spoken models’; this looks really promising,
considering the lack of resources dedicated to MWEs in spoken
language. Future works in this line of research include
the computing of Recall for the models and qualitative
evaluation of the MWEs extracted.
S. Castagnoli, F. Dell’Orletta, H. Dittmann, A. Lenci,
V. Pirrelli, The PAISÀ corpus of Italian web texts,
in: Proceedings of the 9th Web as Corpus
Workshop (WaC-9), Association for Computational
Linguistics, Gothenburg, Sweden, 2014, pp. 36–43.
doi:10.3115/v1/W14-0406.
[26] C. Ramisch, A. Villavicencio, C. Boitet, mwetoolkit:
a framework for multiword expression
identification, in: Proceedings of the Seventh International
Conference on Language Resources and Evaluation
(LREC’10), European Language Resources
Association (ELRA), Valletta, Malta, 2010.
[27] N. Okazaki, Crfsuite: a fast implementation of
conditional random fields (crfs), 2007. URL: http:
//www.chokkan.org/software/crfsuite/.
[28] H. Schmid, Probabilistic part-of-speech tagging
using decision trees, in: Proceedings of the
International Conference on New Methods in Language
Processing, 1994.
[29] A. Lenci, F. Masini, M. Nissim, S. Castagnoli,
G. Lebani, L. Passaro, M. Senaldi, How to harvest
word combinations from corpora: Methods,
evaluation and perspectives, Studi e saggi linguistici 55
(2017) 45–68.
[30] S. Banerjee, T. Pedersen, The design,
implementation, and use of the ngram statistics package, in:
Computational Linguistics and Intelligent Text
Processing, volume 2000, 2003, pp. 370–381. doi:10.
1007/3-540-36456-0_38.
[31] T. Pedersen, S. Banerjee, B. McInnes, S. Kohli,
M. Joshi, Y. Liu, The ngram statistics package
(text::NSP) : A flexible tool for identifying ngrams,
collocations, and word associations, in:
Proceedings of the Workshop on Multiword Expressions:
from Parsing and Generation to the Real World,
Association for Computational Linguistics, Portland,
Oregon, 2011, pp. 131–133.
[32] L. Ramshaw, M. Marcus, Text chunking using
transformation-based learning, in: Third
Workshop on Very Large Corpora, 1995.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>