<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multiword expressions we live by: a validated usage-based dataset from corpora of written Italian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Francesca Masini</string-name>
          <email>francesca.masini@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M. Silvia Micheli</string-name>
          <email>maria.micheli@unimib.it</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Zaninello</string-name>
          <email>azaninello@zanichelli.it</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sara Castagnoli</string-name>
          <email>sara.castagnoli@unimc.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Malvina Nissim</string-name>
          <email>m.nissim@rug.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Alma Mater Studiorum - University of Bologna</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>CLCG, University of Groningen</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Macerata</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Milano-Bicocca</institution>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Zanichelli editore</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The paper describes the creation of a manually validated dataset of Italian multiword expressions, building on candidates automatically extracted from corpora of written Italian. The main features of the resource, such as POS-pattern and lemma distribution, are also discussed, together with possible applications.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        The computational treatment of multiword
expressions (henceforth, MWEs) is notoriously a major
challenge in NLP
        <xref ref-type="bibr" rid="ref18">(Ramish, 2015; Villavicencio
et al., 2005)</xref>
        . In the last decades, the
(computational) linguistics community has dedicated many
efforts to the development of techniques for the
(semi-)automatic identification and extraction of
MWEs from corpora and the consequent creation
of resources, such as gold standard lists of MWEs,
which are needed for evaluation tasks or machine
learning training. This notwithstanding, the
availability of such resources is still quite limited
compared with “the ubiquitous and pervasive nature
of MWEs” (Ramish, 2015), especially for
‘nonmainstream’ languages like Italian.
      </p>
      <p>
        With this work, we contribute to this line of
research by providing a dataset of 1,682 validated
Italian multiword expressions, obtained through
the manual annotation of candidates automatically
extracted from corpora of written Italian within the
CombiNet project
        <xref ref-type="bibr" rid="ref16 ref17">(Simone and Piunno, 2017b)</xref>
        .
The dataset is to be intended as a first release that
will be enriched in the future. We describe our
methodology in Section 2, while in Section 3 we
      </p>
      <p>Copyright c 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
report on preliminary analyses carried out with
respect to MWE features and distribution.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>
        For the creation of the dataset we built on data
extracted within the CombiNet project, where the
computational task of extracting candidate word
combinations from corpora was aimed at
supporting the creation of an online lexicographic
resource for Italian
        <xref ref-type="bibr" rid="ref16 ref17">(Simone and Piunno, 2017a)</xref>
        .
The notion of ‘word combination’ was large
enough to encompass both MWEs
        <xref ref-type="bibr" rid="ref1 ref15 ref2 ref6">(Calzolari et
al., 2002; Sag et al., 2002; Gries, 2008;
Baldwin and Kim, 2010)</xref>
        – namely strings endowed
with (different degrees of) fixedness, idiomaticity
or simply conventionality – and more abstract
distributional properties of a word, such as argument
structures, subcategorization frames or selectional
preferences
        <xref ref-type="bibr" rid="ref8">(Lenci et al., 2017)</xref>
        .
      </p>
      <p>As a consequence, two different extraction
methods – both based on the technique of
searching corpora1 with sets of patterns, and ranking
retrieved candidates using frequency and association
measures – were used.2 More precisely, the search
was performed using, in turn, shallow
part-ofspeech (POS) sequences and syntactic relations:
the former method performs better with fixed and
adjacent word combinations, whereas the latter is
more efficient for syntactically flexible
combinations. Since for the present work we focus more on
MWEs proper rather than combinatorics in
general, we opted to use the data previously gathered
with the POS-based method.</p>
      <p>
        Candidates were obtained by feeding the EXTra
software
        <xref ref-type="bibr" rid="ref13">(Passaro and Lenci, 2015)</xref>
        with a list of
122 POS-patterns deemed representative of Italian
1The corpora used within CombiNet were la Repubblica
        <xref ref-type="bibr" rid="ref3">(Baroni et al., 2004)</xref>
        and PAISA`
        <xref ref-type="bibr" rid="ref10">(Lyding et al., 2014)</xref>
        .
      </p>
      <p>
        2For a full description of the methods and their
assessment, see
        <xref ref-type="bibr" rid="ref8">(Lenci et al., 2017)</xref>
        In what follows we only provide
information which is relevant for the current discussion.
MWEs, derived from both relevant literature and a
corpus-driven identification task; the list includes
adjectival, adverbial, nominal, prepositional and
verbal patterns, up to five slots
        <xref ref-type="bibr" rid="ref8">(see Lenci et al.,
2017)</xref>
        . The results were ranked by LogLikelihood.
      </p>
      <p>As a first step, we selected top-ranked results by
cutting at LL 7,500, which we observed to be a
good balance between precision (high chance of
being a MWE) and recall (enough variety),
yielding 7,045 candidates. Then we manually
annotated this list of candidates to obtain the gold
standard inventory of Italian MWEs released and
described in the present paper. Each candidate was
validated independently by two annotators, and a
third annotator judged the conflicted cases,3 which
amounted to 673 (less than 10%). We validated
sequences that were deemed to display some type
of conventionality (fixedness, idiomaticity, high
familiarity of use). We included only MWEs in
their ‘full’ form (e.g., punto di partenza ‘starting
point’, in breve tempo ‘in a short time’), thus
excluding sequences that were clearly part of
incomplete MWEs (e.g. scanso di equivoci, lit.
avoidance of misunderstandings, as part of the larger
adverbial MWE a scanso di equivoci, lit. at
avoidance of misunderstandings, ‘to avoid
misunderstandings’).
3</p>
    </sec>
    <sec id="sec-3">
      <title>The Resource</title>
      <p>The final list of valid MWEs amounts to 1,682
(about 24% of the candidates), and is made
available to the community.4 The resource contains the
following information: (i) lemmatized MWE;5 (ii)
corresponding POS-pattern;6 (iii) corpus/corpora
where the MWE was found; (iv) LogLikelihood;
(v) raw frequency.
3.1</p>
      <sec id="sec-3-1">
        <title>Caveat</title>
        <p>In order to make our resource re-usable on the
very same corpora employed for the extraction,
3All annotations were performed by the authors.</p>
        <p>4DOI: 10.6092/unibo/amsacta/6506.
http://amsacta.unibo.it/id/eprint/6506
5MWEs are lemmatized because the extraction was
performed using lemmas. A consequence of this is that we may
have two identical lemmatized sequences that however differ
in POS-tagging. For instance, cambio di guardia (lit. change
of guard) occurs twice: in one case di ‘of’ is tagged as a bare
preposition, in the other as an articulated preposition (della
‘of the’), giving rise to two partially different MWEs (the
latter may mean both ‘changing of the guard’ and ‘changeover
of leaders’, whereas the former can refer only to the second
of these meanings).</p>
        <p>6The tagset is available here: http://medialab.
di.unipi.it/wiki/Tanl_POS_Tagset
we kept all data in their original form. This
means that lemmatization and POS-tagging were
retained, even if erroneous.</p>
        <p>Examples of errors and anomalies include:
(a) inconsistent lemmatization, especially for
prepositions (e.g. radere al suolo ‘raze to the
ground’ occurs twice, lemmatized as radere a
suolo and radere al suolo, although the preposition
is correctly tagged as an articulated preposition in
both cases) and conjunctions (e.g. carne e ossa
‘flesh and blood’ and the almost identical carne
ed ossa, with the euphonic -d on the conjunction e
‘and’, are two separate items);</p>
        <p>(b) wrong lemmatization and tagging,
especially for participial-like forms (e.g. centro abitato
‘residential area’, lit. center inhabited, lemmatized
as centro abitare, lit. center to inhabit; or posta
elettronica ‘electronic mail’ lemmatized as porre
elettronico, lit. to put electronic, since posta is
interpreted as the feminine past participle of porre
‘to put’ and not as the noun posta ‘mail’), but
not only (e.g. lavori di costruzione ‘construction
works’ lemmatized as lavorio [instead of lavoro]
di costruzione; or meccanica quantistica
‘quantum mechanics’ where meccanica is tagged as an
adjective);</p>
        <p>(c) multiple tagging for the same form (essere
vero ‘be true’ occurs twice because vero is tagged
sometimes as an adjective, sometimes as an
adverb).</p>
        <p>Tricky cases also include lexicalized forms
(guarda caso ‘strangely enough’, where guarda is
– correctly, from the technical point of view –
lemmatized as guardare ‘look’ and tagged as verb,
although it is no longer a verb within that lexicalized
expression) and pronominal verbs (like sentirsi in
dovere ’to feel obliged’, where the verb is
lemmatized as sentire ’to feel’, and not as its reflexive
form sentirsi, although the MWE requires the
reflexive form).
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>POS-patterns</title>
        <p>The validated MWEs in this first release
instantiate 82 POS patterns out of the 122 used for the
extraction (cf. Section 2). Non-represented
patterns (over 30% of the original set) include e.g.
Prep-Adj-Verb (e.g. per quieto vivere ‘for a quiet
life’) as well as more complex – and arguably less
frequent – patterns such as N-Prep-ArtDef-N-Adj
(e.g. lotta contro la criminalita` organizzata ‘fight
against organized crime’).</p>
        <sec id="sec-3-2-1">
          <title>N-Prep-N</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>V-ArtDef-N</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>V-Prep-N V-N</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>V-ArtIndef-N N-A</title>
        </sec>
        <sec id="sec-3-2-5">
          <title>N-PrepArt-N</title>
        </sec>
        <sec id="sec-3-2-6">
          <title>Prep-N-Prep</title>
        </sec>
        <sec id="sec-3-2-7">
          <title>PrepArt-N-Prep</title>
        </sec>
        <sec id="sec-3-2-8">
          <title>Prep-N</title>
          <p>V-Adv
N-N
V-Adj</p>
        </sec>
        <sec id="sec-3-2-9">
          <title>V-PrepArt-N</title>
        </sec>
        <sec id="sec-3-2-10">
          <title>Prep-ArtDef-N</title>
          <p>Fq.
165
152
110</p>
          <p>Example
punto di vista
‘viewpoint’
valere la pena
‘to be worth’
scendere in campo
‘to take the field’
83 avere paura</p>
          <p>‘to be afraid’
83 correre un rischio</p>
          <p>‘to run a risk’
80 tavola rotonda</p>
          <p>‘round table’
79 vigile del fuoco</p>
          <p>‘fireman’
77 di fronte a</p>
          <p>‘in front of’
75 al fine di</p>
          <p>‘with the aim of’
63 di parte</p>
          <p>‘biased’
62 andare avanti</p>
          <p>‘to go on’
62 piano terra</p>
          <p>‘ground floor’
55 essere presente</p>
          <p>‘to be there’
47 entrare nel merito</p>
          <p>‘to address’
35 dietro le quinte</p>
          <p>‘behind the scenes’</p>
          <p>Overall, most attested patterns are 2- or
3grams. The first 4-slot pattern V-Prep-ArtIndef-N
only appears at rank 36, corresponding to 8
different MWEs (e.g. rispondere a una domanda ‘to
answer a question’).</p>
          <p>
            In terms of lexical categories, expectedly, most
frequent patterns pertain to the nominal and
verbal domains. The N-Prep(Art)-N type is the most
common pattern for complex nominals, in
agreement with theoretical literature
            <xref ref-type="bibr" rid="ref11">(Masini, 2009,
e.g.)</xref>
            . Patterns headed by prepositions and
giving rise to complex prepositions, conjunctions and
modifiers are also numerous.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Pattern</title>
        <sec id="sec-3-3-1">
          <title>Prep-Adj-Conj-Adj</title>
        </sec>
        <sec id="sec-3-3-2">
          <title>V-ArtDef-N-A</title>
        </sec>
        <sec id="sec-3-3-3">
          <title>A-Prep-V</title>
        </sec>
        <sec id="sec-3-3-4">
          <title>V-Prep-Adj-N</title>
        </sec>
        <sec id="sec-3-3-5">
          <title>Adj-Prep-N</title>
          <p>Example
1 in bianco e nero</p>
          <p>‘in black and white’
1 dare il via libera</p>
          <p>‘to give green light’
1 difficile a dirsi</p>
          <p>‘difficult to say’
1 mettere a dura prova</p>
          <p>‘to put a strain (on)’
1 degno di nota</p>
          <p>‘noteworthy’
The single-word lemmas that concur to form the
MWEs in our list amount to 1,235.</p>
          <p>Not surprisingly, among the most used lemmas
we find function words like prepositions (di ‘of’
fq.421; in ‘in’ fq.227; al ‘at/to the’ fq.124, a ‘at/to’
fq.55 and ad ‘at/to’ fq.10; per ‘for’ fq.50; da
‘from’ fq.34; su ‘on’ fq.24; con ‘with’ fq.20) and
determiners (il ‘the’ fq.208; un ‘a’ fq.71 and una
‘a’ fq.41), which appear in many POS-patterns.
Conjunctions are instead less frequent (e ‘and’
fq.21 and ed ‘and’ fq.4; o ‘or’ fq.4), like
quantifiers (e.g. ogni ‘each’ fq.11).</p>
          <p>Quite expectedly, top-ranked verbs (essere ‘to
be’ fq.67; fare ‘to do/make’ fq.46; avere ‘to have’
fq.36; mettere ‘to put’ fq.35; prendere ‘to take’
fq.27; andare ‘to go’ fq.19; dare ‘to give’ fq.17)
and top-ranked nouns (tempo ‘time’ fq. 32; mano
‘hand’ fq.26; parte ‘part’ fq.23; posto ‘place’
fq.17; giorno ‘day’ fq.16) are lexemes carrying a
generic meaning, which favors their combinatory
power. Among the mostly used words we also find
numerals like primo ‘first’ (fq.30) or secondo
‘second’ (fq.18), and adverbs like non ‘not’ (fq.29).</p>
          <p>
            A cursory comparison between the lemmas of
the MWEs in our list and the Vocabolario di Base
            <xref ref-type="bibr" rid="ref4">(De Mauro, 1980)</xref>
            , which contains the 7,000 most
common lemmas in Italian, shows a large
convergence: well over 70% of our lemmas are included
in the Vocabolario di Base. Thus, very frequent
MWEs also feature very common lexical items.
3.4
          </p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Distribution in corpora</title>
        <p>The distribution of MWEs in the two corpora used
for the extraction is shown in Table 3.</p>
        <p>We retrieved more MWEs from la Repubblica</p>
      </sec>
      <sec id="sec-3-5">
        <title>N. of MWEs</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <sec id="sec-4-1">
        <title>Corpus</title>
        <p>la Repubblica (total)
PAISA` (total)
la Repubblica (only)
PAISA` (only)
Both
than PAISA`, which is expected given that the
latter is smaller in size (250M tokens vs. 380M).
What is less expected is the rather low number of
MWEs shared by the two corpora, amounting to
372, hence 22%. Although la Repubblica is a
journalistic source and PAISA` is a web corpus
containing more varied text genres (especially from
Wikimedia Foundation projects), we expected a larger
convergence, considering that they both contain
written (mid-)formal texts and that PAISA` also
contains texts from the news.</p>
        <p>Some POS-patterns seem to be definitely more
typical of one corpus over the other. As Table 4
illustrates, the N-Prep-N pattern, for instance, is
much more typical of la Repubblica, whereas the
N-Adj pattern is more attested in PAISA`.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Corpus</title>
        <p>la Repubblica (only)
PAISA` (only)
Both</p>
      </sec>
      <sec id="sec-4-3">
        <title>N-Prep-N N-Adj</title>
        <p>Among top-ranked MWEs for both
LogLikelihood and raw frequency we find in grado di ‘able
to’ and per la prima volta ‘for the first time’, in
both corpora. The highest ranked MWEs in PAISA`
is voce correlata ‘see also’, which is obviously due
to the texts that form this resource. Generally,
topranked MWEs for LogLikelihood also have high
frequency, but not in all cases: essere in essere ‘to
exist’, for instance, turns out to be highly
significant in terms of LogLikelihood but has a very low
frequency in both corpora.</p>
        <p>The sequences contained in this release are
obviously quite heterogeneous.</p>
        <p>Semantically speaking, some are very idiomatic
in meaning (e.g. braccio di ferro ‘arm wrestling’,
colpo di scena ‘coup de the´aˆtre’, mandare in onda
‘to broadcast’), some other (much) less so (e.g.
prendere le distanze ‘to distance (oneself)’,
andare in pensione ‘to retire’, di servizio ‘service
(adj.)’), their specialty lying more in their
familiar, conventional status (e.g. sapere benissimo ‘to
know (damn) well’, essere favorevole ‘to be in
favour’, nella storia ‘in history’). Still others may
have more than one meaning, with different
degrees of figurativity (e.g. mettere in scena, which
can mean both ‘to stage’ and ‘to enact’).</p>
        <p>
          From a formal point of view, some look rather
fixed and do not admit lexical insertion (e.g. vero
e proprio ‘proper’) or inflection (e.g. tra l’altro
‘by the way’, ordine del giorno ‘agenda’), whereas
others seem more flexible (e.g. essere certo ‘to
be sure’, andare bene ‘to be OK, to go well’,
posto di lavoro ‘workplace’). MWE variability
is one aspect that we did not address here but
definitely deserves to be investigated more
thoroughly (cf. e.g.
          <xref ref-type="bibr" rid="ref12">(Nissim and Zaninello, 2011)</xref>
          ). In
fact, some MWEs may exhibit different behaviour
and even completely different meanings
according to their grammatical form, like, for example,
a suo tempo ‘in due course’ (lit. in his/her time)
vs. ai suoi tempi ‘in his/her time’ (lit. in his/her
times). Being based on lemmatized forms, our
study does not currently account for such form
differences. Moreover, our study is based on
contiguous sequences, therefore discontinuous or
topicalized occurrences are not accounted for.
        </p>
        <p>
          We also aim at broadening this initial list by
exploring more candidates from the CombiNet data,
which are obviously still rich of relevant
material. This first release, although limited, is
meaningful since it is the first list of commonly used
MWEs available for the Italian language, except
for domain-specific resources such as PANACEA
          <xref ref-type="bibr" rid="ref5">(Frontini et al., 2012)</xref>
          . Although lexicographic
material is now accessible for Italian lexical
combinatorics (see e.g.
          <xref ref-type="bibr" rid="ref9">(Lo Cascio, 2013)</xref>
          ),
usagebased and freely available lists of MWEs are still
missing and much needed, both for computational
tasks and for applied (lexicographic and language
teaching related) purposes.
This research relies on data extracted within the
CombiNet project (PRIN 2010-2011 Word
Combinations in Italian, n. 20105B3HE8), coordinated
by Raffaele Simone and Alessandro Lenci, and
funded by the Italian Ministry of Education,
University and Research (MIUR).
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Calzolari</given-names>
            <surname>Nicoletta</surname>
          </string-name>
          , Fillmore Charles J.,
          <string-name>
            <surname>Grishman</surname>
            <given-names>Ralph</given-names>
          </string-name>
          , Ide Nancy, Lenci Alessandro,
          <source>MacLeod Catherine and Zampolli Antonio</source>
          .
          <year>2002</year>
          .
          <article-title>Towards best practice for multiword expressions in computational lexicons</article-title>
          . In Rodr´ıguez, M. G. and
          <string-name>
            <surname>Araujo</surname>
          </string-name>
          , C. P. S. (eds.),
          <article-title>Towards Best Practice for Multiword Expressions in Computational Lexicons</article-title>
          . LREC,
          <year>1934</year>
          -
          <fpage>40</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Baldwin</given-names>
            <surname>Timothy</surname>
          </string-name>
          and Kim Su Nam.
          <year>2010</year>
          .
          <article-title>Multiword expressions</article-title>
          . In Indurkhya, N. and
          <string-name>
            <surname>Damerau</surname>
            ,
            <given-names>F. J</given-names>
          </string-name>
          . (eds.),
          <source>Handbook of natural language processing</source>
          ,
          <volume>267</volume>
          -
          <fpage>29</fpage>
          . Taylor and Francis Group, Boca Raton (FL).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Baroni</given-names>
            <surname>Marco</surname>
          </string-name>
          , Bernardini Silvia, Comastri Federica, Piccioni Lorenzo, Volpi Alessandra, Aston Guy and
          <string-name>
            <given-names>Mazzoleni</given-names>
            <surname>Marco</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Introducing the La Repubblica Corpus: A Large, Annotated, TEI (XML)- compliant Corpus of Newspaper Italian</article-title>
          . In Lino, M. T.,
          <string-name>
            <surname>Xavier</surname>
            ,
            <given-names>M. F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferreira</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Costa</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Silva</surname>
          </string-name>
          , R. (eds),
          <source>Proceedings of the Third International Conference on Language Resources and evaluation (LREC)</source>
          ,
          <fpage>1771</fpage>
          -
          <lpage>4</lpage>
          .
          <string-name>
            <given-names>European</given-names>
            <surname>Language Resources Association</surname>
          </string-name>
          , Lisbon.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>De Mauro Tullio</surname>
          </string-name>
          .
          <year>1980</year>
          .
          <article-title>Guida all'uso delle parole</article-title>
          .
          <source>Editori Riuniti</source>
          , Roma.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Frontini</given-names>
            <surname>Francesca</surname>
          </string-name>
          , Quochi Valeria, and
          <string-name>
            <given-names>Rubino</given-names>
            <surname>Francesco</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Automatic Creation of quality Multi-word Lexica from noisy text data</article-title>
          . In Kay, M, Boitet, C. (eds.),
          <source>Proceedings of COLING 2012: Sixth Workshop on Analytics for Noisy Unstructured Text Data: 24th International Conference on Computational Linguistics COLING</source>
          <year>2012</year>
          ;
          <article-title>2012 December 8-15</article-title>
          . http://hdl.handle.net/10230/20422.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Gries Stefan</surname>
            <given-names>T.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>Phraseology and linguistic theory: A brief survey</article-title>
          . In Granger, S. and
          <string-name>
            <surname>Meunier</surname>
            ,
            <given-names>F</given-names>
          </string-name>
          . (eds.),
          <source>Phraseology: An interdisciplinary perspective</source>
          ,
          <fpage>3</fpage>
          -
          <lpage>25</lpage>
          . John Benjamins, Amsterdam/Philadelphia.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Lenci</given-names>
            <surname>Alessandro</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Carving verb classes from corpora</article-title>
          . In Simone, R. and
          <string-name>
            <surname>Masini</surname>
            ,
            <given-names>F</given-names>
          </string-name>
          . (eds.),
          <source>Word Classes. Nature, typology and representation</source>
          ,
          <volume>17</volume>
          -
          <fpage>36</fpage>
          . John Benjamins, Amsterdam/Philadelphia.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Lenci</given-names>
            <surname>Alessandro</surname>
          </string-name>
          , Masini Francesca, Nissim Malvina, Castagnoli Sara, Lebani Gianluca E.,
          <string-name>
            <surname>Passaro</surname>
            <given-names>Lucia C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Senaldi</surname>
            <given-names>Marco S. G.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>How to harvest Word Combinations from corpora: Methods, evaluation and perspectives</article-title>
          . Studi e Saggi linguistici,
          <volume>55</volume>
          (
          <issue>2</issue>
          ):
          <fpage>45</fpage>
          -
          <lpage>68</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Lo</given-names>
            <surname>Cascio Vincenzo</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Dizionario combinatorio italiano</article-title>
          .
          <source>John Benjamins</source>
          , Amsterdam/Philadelphia.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Lyding</given-names>
            <surname>Verena</surname>
          </string-name>
          , Stemle Egon, Borghetti Claudia, Brunello Marco, Castagnoli Sara,
          <string-name>
            <surname>Dell'Orletta Felice</surname>
          </string-name>
          , Dittmann Henrik,
          <source>Lenci Alessandro and Pirrelli Vito</source>
          <year>2014</year>
          .
          <article-title>The PAISA corpus of Italian web texts</article-title>
          .
          <source>9th Web as Corpus Workshop (WaC-9)@ EACL</source>
          <year>2014</year>
          ,
          <volume>36</volume>
          -
          <fpage>43</fpage>
          . EACL (
          <article-title>European chapter of the Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Masini</given-names>
            <surname>Francesca</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Phrasal lexemes, composunds and phrases</article-title>
          .
          <source>Word Structure</source>
          ,
          <volume>2</volume>
          (
          <issue>2</issue>
          ):
          <fpage>254</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Nissim</given-names>
            <surname>Malvina</surname>
          </string-name>
          and
          <string-name>
            <given-names>Zaninello</given-names>
            <surname>Andrea</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>A quantitative study on the morphology of Italian multiword expressions</article-title>
          .
          <source>Lingue e linguaggio</source>
          ,
          <volume>10</volume>
          (
          <issue>2</issue>
          ):
          <fpage>283</fpage>
          -
          <lpage>300</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Passaro Lucia</surname>
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Lenci</given-names>
            <surname>Alessandro</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Extracting Terms with EXTra</article-title>
          . In Corpas Pastor,
          <string-name>
            <surname>G</surname>
          </string-name>
          . (ed.),
          <source>Computerised and Corpus-based Approaches to Phraseology. Monolingual and Multilingual Perspective</source>
          ,
          <fpage>188</fpage>
          -
          <lpage>196</lpage>
          . Editions Tradulex, Geneva.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Ramisch</given-names>
            <surname>Carlos</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Multiword Expressions Acquisition - A Generic and</article-title>
          Open Framework. Springer, Dordrecht.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Sag</given-names>
            <surname>Ivan</surname>
          </string-name>
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Baldwin</surname>
          </string-name>
          <string-name>
            <surname>Timothy</surname>
          </string-name>
          , Bond Francis, Copestake Ann and
          <string-name>
            <given-names>Flickinger</given-names>
            <surname>Dan</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Multiword expressions: A pain in the neck for NLP</article-title>
          .
          <source>International conference on intelligent text processing and computational linguistics</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          . Springer, Dordrecht.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Simone</given-names>
            <surname>Raffaele</surname>
          </string-name>
          and
          <string-name>
            <given-names>Piunno</given-names>
            <surname>Valentina</surname>
          </string-name>
          . 2017a.
          <article-title>Entry word combination: lexicographical representation and lexicological aspects</article-title>
          .
          <source>Studi e Saggi Linguistici</source>
          ,
          <volume>55</volume>
          (
          <issue>2</issue>
          ):
          <fpage>13</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Simone</given-names>
            <surname>Raffaele</surname>
          </string-name>
          and Piunno Valentina, editors. 2017b.
          <article-title>Word Combinations: phenomena, methods of extraction, tools</article-title>
          , Special Issue of Studi e Saggi Linguistici,
          <volume>55</volume>
          (
          <issue>2</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Villavicencio</given-names>
            <surname>Aline</surname>
          </string-name>
          , Bond Francis, Korhonen Anna,
          <source>McCarthy Diana</source>
          .
          <year>2005</year>
          .
          <article-title>Introduction to the special issue on multiword expressions: Having a crack at a hard nut</article-title>
          .
          <source>Computer Speech and Language</source>
          ,
          <volume>19</volume>
          (
          <issue>4</issue>
          ):
          <fpage>365</fpage>
          -
          <lpage>377</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>