=Paper= {{Paper |id=Vol-2253/paper55 |storemode=property |title=PARSEME-IT - Issues in verbal Multiword Expressions Identification and Classification |pdfUrl=https://ceur-ws.org/Vol-2253/paper55.pdf |volume=Vol-2253 |authors=Johanna Monti,Valeria Caruso,Maria Pia Di Buono |dblpUrl=https://dblp.org/rec/conf/clic-it/MontiCB18 }} ==PARSEME-IT - Issues in verbal Multiword Expressions Identification and Classification== https://ceur-ws.org/Vol-2253/paper55.pdf
                                PARSEME-IT
     Issues in verbal Multiword Expressions identification and classification
                          Johanna Monti1 , Valeria Caruso1 , Maria Pia di Buono2
1
    Dep. of Literary, Linguistic and Comparative Studies “L’Orientale” University of Naples, Italy
    2
      TakeLab - Faculty of Electrical Engineering and Computing - University of Zagreb, Croatia
        jmonti@unior.it, vcaruso@unior.it, mariapia.dibuono@fer.hr


                          Abstract                       of the ACL Special Interest Group on the Lexicon,
                                                         called SIGLEX-MWE.
          English.     The second edition of the            In its first edition, the PARSEME shared task
          PARSEME shared task was based on new           released a corpus of 5.5 million tokens and 60,000
          guidelines and methodologies that partic-      VMWE annotations in 18 different languages
          ularly concerned the Italian language with     which is distributed under different versions of the
          the introduction of new categories of verbs    Creative Commons license. To increase the com-
          not considered in the previous edition.        putational efficiency of Natural Language Pro-
          This contribution presents the novelties       cessing (NLP) applications, PARSEME focuses
          introduced, the results obtained and the       on a special class of Multiword Expressions which
          problems that emerged during the anno-         have been seldom modelled for their challenging
          tation process and concerning some cate-       nature, such as verbal MWEs (Savary et al., 2017).
          gories of verbs.                                  Many of the features of this particular type of
                                                         MWE are considered to be difficult to cope with,
          Italiano.      La seconda edizione del
                                                         such as the discontinuity they present (turn it off)
          PARSEME shared task si è basata su
                                                         the syntactic variations they license (the decision
          nuove linee guida e metodologie che
                                                         was hard to take), the semantic variability re-
          hanno riguardato in particolar modo la
                                                         sulting both in literal and idiomatic readings (to
          lingua italiana con l’introduzione di nuove
                                                         take the cake), or the syntactic ambiguity of many
          categorie di verbi non considerate nella
                                                         forms (on is a preposition in to trust on some-
          precedente edizione. Il contributo pre-
                                                         body, but a particle in to take on the task). More-
          senta le novità introdotte, i risultati ot-
                                                         over, these units have language-specific features,
          tenuti e le problematiche che sono emerse
                                                         and are generally modelled according to descrip-
          durante l’annotazione relativamente ad
                                                         tive categories developed by different traditions of
          alcune categorie di verbi.
                                                         linguistic studies. The PARSEME research group
                                                         thus addresses also the creation of a multilingual
                                                         common platform for VMVEs using universal ter-
1         Introduction
                                                         minology, guidelines and methodologies for the
The paper reports on some final results of the sec-      identification of these units cross-linguistically.
ond edition of an annotation trial for verbal Mul-       Moreover, at the end of the first annotation trial a
tiword Expressions (VMWEs) carried out on the            shared task on automatic identification of VMWEs
Italian language by the PARSEME-IT research              was also carried out and has proved the reliability
group 1 , which started within the broader Euro-         and usefulness of the data collected so far, which
pean PARSEME project, the IC1207 COST action             have been already presented and discussed (Savary
ended in April 20172 .                                   et al., 2017; Monti et al., 2017).
   The initial project is expanding in this second          The paper illustrates the types of VMWEs used
stage of its development, thanks to a wider net-         by the second PARSEME annotation trial more
work of research groups, working together as one         thoroughly. In Section 2 we provide a brief de-
      1                                                  scription of the second annotation trial of the
    https://sites.google.com/view/
parseme-it/home                                          PARSEME shared task together with the statistics.
  2
    https://typo.uni-konstanz.de/parseme/                Then we present a new category of verbal MWEs,
namely Inherently Clitic Verbs (Section 3) and in                 The other classifying categories used are (a)
Section 4 two very productive categories in Ital-              light verb constructions (LVCs), e.g. fare una
ian (IRV and IDV). In Section 5, we discuss some               passeggiata (to have a walk), and (b) idioms (ID),
borderline cases which posed some classification               e.g., tirare le cuoia (to kick the bucket), consid-
issues. Finally, we conclude and discuss future                ered to be universal categories or categories which
work.                                                          can be found in all languages participating in the
                                                               task.
2    PARSEME Shared Task Second                                   Other VMWEs are instead maintained as quasi-
     annotation trial: a brief report                          universal categories, since their range of applica-
                                                               tion seems to cover only some language groups or
This section focuses on the novelties which have
                                                               languages, but not all. They are (c) inherently
been introduced in the guidelines and methodolo-
                                                               reflexive verbs (IReflVs), and (d) verb-particle
gies used for the second annotation trial in order
                                                               constructions (VPCs). The first group (IReflVs)
to cover a wider range of VMWEs which were left
                                                               allows annotators to account for verbs which are
apart in the first stage of the project. The improve-
                                                               never used without a reflexive clitic pronoun, e.g.,
ments seem to be particularly valuable for the data
                                                               (Italian) suicidarsi (to suicide), or for those verbs
collection carried out on the Italian language, be-
                                                               whose meaning is significantly affected by the
cause they address some peculiarities of the Ital-
                                                               pronoun, e.g., (Italian) farsi (to take drugs) while
ian language which were not considered in the first
                                                               the non-pronominal form, fare, means to make.
edition of the shared task but have been taken into
                                                               Semantic aspects are also used to identify Verb-
account in the second edition, namely:
                                                               particle constructions (VPC) because their mean-
    • Inherently clitic verbs (ICV), which is an               ing is fully non-compositional, e.g., buttare giù (to
      extremely rich and varied VMWE category in               swallow), or only partly non-compositional, like
      Italian (Masini, 2015). As described in Sec-             in tirare avanti (to go on) since the preposition no
      tion 3, a language specific category was cre-            longer owns its spatial meaning.
      ated for the Italian language (LS.ICV) which                Table 1 presents the statistics of the various cat-
      takes into account only those verbs whose se-            egories of VMWEs in the PARSEME-IT corpus
      mantics is changed by a non-reflexive clitic             1.1.
      pronoun, like entrarci when it means to be
      relevant to something, while the intranstive             3    A language specific category:
      form of the verb entrare means to enter.                      Inherently clitic verbs (LS.ICV)

    • Inherently adpositional verbs (IAV), a high              Inherently Clitic Verbs (LS.ICV) represent a spe-
      frequency category of VMWEs, namely                      cific category for some Romance languages, and
      those verbs whose meanings are significantly             they are particularly frequent in the Italian lan-
      affected by an “idiomatic selected preposi-              guage. It is often challenging to distinguish
      tion”, like su in contare su qualcuno (to                LS.ICV from Inherently Reflexive Verbs (IRV),
      rely on someone): without the preposition                particularly because some clitics may be ambigu-
      the verb means only to determine the total               ous, like se/si which is a polyfunctional clitic pro-
      number of something. These verbs are often               noun and grammatical marker (and can have a re-
      called prepositional verbs3 .                            flexive, reciprocal, impersonal, passivizing, aspec-
                                                               tual, and middle function). LS.ICVs together with
    • Multi verb constructions (MVC), VMWEs                    IRVs are pronominal verbs. LS.ICV are formed
      composed by a sequence of two adjacent                   by a full verb combined with one or more non-
      verbs (in a language-dependent order), a gov-            reflexive clitic that represents the pronominaliza-
      erning verb V gov (also called a vector verb)            tion of one or more complement (CLI).
      and a dependent verb V dep (also called a po-               The following verbs should be annotated as
      lar verb), like in lasciar perdere (to give up).         LS.ICV:
    3
      Schneider,    N.,   Green,     M.,     2015,     New         • The verb without the CLI does not exist, e.g.,
Guidelines      for   Annotating    Prepositional     Verbs,
https://github.com/nschneid/nanni/wiki/Prepositional-                infischiarsene (do not worry about) vs *infis-
Verb-Annotation-Guidelines                                           chiare;
                sent.   tokens   VMWEs      IAV    IRV    LS.ICV    LVC.cause/full   MVC    VID    VPC.full/semi
    IT-dev       917     32613      500       44    106         9          19/100       6    197            17/2
    IT-train   13555    360883     3254     414     942        20         147/544      23   1098            66/0

                                    Table 1: PARSEME-IT corpus version 1.1


     • The verb without the CLI does exist, but has           anaphoric expression which stands for se stesso
       a very different meaning as in prenderle (gl.:         (oneself) or a mutual expression which refers to
       to take them, transl. to be beaten) vs prendere        gli uni e gli altri (these and those). Another rele-
       (to take) or prenderci (gl.: to take it, transl. to    vant aspect to consider in the classification of IRVs
       grasp the truth) vs prendere (to take);                is the presence of an implicit thematic role due to
                                                              the fact that the action includes two different en-
     • The verb has more than one CLI of which the            tities with different thematic properties but with
       second one is an invariable object comple-             the same reference, e.g., in guardarsi (to look at
       ment, like in fregarsene (gl.: matter self of-         oneself) the clitic signals the presence of coref-
       it, transl. do not care about) or infischiarsene       erence between the first argument and the second
       (do not worry about);                                  one. Another source of mis-classification of IRVs
     • The verb has two non-reflexive invariable              is related to the presence of anticausative construc-
       CLIs, like in farcela (gl.: to make there it,          tions. In these constructions, the clitic may repre-
       transl. to succeed);                                   sent an overt marker of reduced transitivity, , e.g.,
                                                              sedersi (to sit down).
     • The verb has a different meaning with re-              In some cases, IRVs occur in idiomatic construc-
       spect to an intensive use of the same two non-         tion and their meaning is affected by the presence
       reflexive invariable CLIs, like in andarsene           of new elements, such as in guardarsi bene da (to
       (gl.: to go away self from-there, transl. to die)      be careful not to). Consequently the annotation of
       vs andarsene (to go away) or bersela (gl.:             such occurrences is subjected to the evaluation of
       drink self it, transl. to believe) vs bersela (to      characteristics related to VID, as the low variabil-
       drink it).                                             ity, the presence of semantic non-compositional
                                                              meaning, and the literal-idiomatic ambiguity. In
The annotation of LS.ICV was performed follow-                the VID class, the non-compositionality prop-
ing a specific decision tree 4 .                              erty is prototypical such as in battersi all’ultimo
   In the training corpus 20 different LS.ICV were            sangue (lit. to fight till the last blood) which
annotated manually, such as farcela, rimetterci,              means to fight to the last. Despite their mean-
fregarsene among others.                                      ing is opaque, sometimes VID may have both a
                                                              literal and idiomatic meaning and the boundaries
4     Very productive VMWEs: IRVs and
                                                              between them are difficult to trace. For example,
      VID
                                                              avere gli occhi bendati (lit. to have the eyes cov-
IRVs and VID represent very productive cate-                  ered) has both a literal meaning and an idiomatic
gories in Italian which pose some classifying is-             one and in this latter case it should be translated
sues due to their specific characteristics.                   in English as to be blindfold. According to Vi-
   With reference to IRVs, the presence of the                etri (2014b), it is possible to classify ordinary-
clitic pronoun si may generate ambiguity in the               verb VID, namely VID which present a semanti-
annotation process, as in Italian it refers to three          cally full verb, on the basis of their definitional
different types of construction: i) reflexive, ii) im-        structure, identified by means of the arguments re-
personal, iii) inherent.                                      quired by the operators. In the case of VID, the
   In order to distinguish these cases, we consider           operator consists of the verb and the fixed ele-
that in the reflexive construction, the clitic pro-           ment(s), while the argument may be the subject
noun can be paraphrased by means of either an                 and/or a free complement. VIDs can be formed
  4
                                                              also by constructions based on the use of support
    http://parsemefr.lif.univ-mrs.fr/
parseme-st-guidelines/1.1/?page=060_                          verbs, namely avere (to have), e.g., avere fegato
Language-specific_tests/015_Inherently_                       (lit. to have leaver, transl. to have guts) essere
clitic_verbs__LB_LS.ICV_RB_
(to be), e.g., essere a cavallo (to be golden) and     following:
fare (to make), e.g., fare lo gnorri (to play fool).
The main difference between this class of VID and      1.a Tendere a + N (to be inclined to something),
the one formed by ordinary verbs is that support           base form tendere (to stretch), e.g., Maria
verbs are semantically empty, and for this reason          tende alla depressione (Maria tends to be de-
this class of VID presents a high degree of lexical        pressed);
and syntactic variability. This type of variability
is retrievable in aspectual variants, the production   1.b Tendere a + V (to be inclined to something),
of causative constructions, the possible deletion of       e.g., Maria tende a dimagrire (Maria tends to
the support verb which causes complex nominal-             loose weight);
izations (Vietri, 2014a).

                                                       2. Puntare su + N (to bet), base form puntare (to
5   Borderline cases: LVC and IAV
                                                          stick), e.g., puntare su qualcuno/qualcosa.
    compared

In this section we discuss the novelties concerning       These examples exhibit clear semantic changes
two categories used in the second edition of the       from the non-adpositional base form of the verb;
PARSEME shared identification task of verbal           moreover, the preposition can not be omitted in
MWEs (edition 1.1), namely LVC and IAV. As             questions, thus proving to be part of the verb:
regards LVC, two new subcategories have been              - Maria tende sempre ad esagerare.
introduced in the second edition, LVC.full and            - A cosa tende, scusa?
LVC.cause, to account for a more fine-grained
                                                          Less prototypical IAV examples include verb
distinction between LVCs, where the verb is
                                                       instances exhibiting semantic changes pivoted
semantically totally bleached (e.g., to have the
                                                       by the arguments they combine with, like an-
right), and those where the verb adds a causative
                                                       dare in (both to go to and to become), or
meaning (and a new semantic role) to the noun
                                                       sapere di (to smell and to know about). The
(e.g., to grant the right). Therefore some new
                                                       type of semantic interaction at stake, called co-
tests have been added to account for these sub-
                                                       composition in the Generative Lexicon5 , is real-
categories, which heavily rely on the notion of
                                                       ized when ”the complements carry information
semantic arguments.
                                                       which acts on the governing verb, essentially tak-
In particular, constructions annotated as
                                                       ing the verb as argument and shifting its event
LVC.cause may involve: i) verbs that are
                                                       type” (Pustejovsky, 1995). For example, an-
typically used to express the cause of predicative
                                                       dare in denotes directed motion when combined
nouns in general (e.g., cause, provoke), ii) verbs
                                                       with proper or common place nouns like in an-
that are only used to express the cause of partic-
                                                       dare in città/montagna/America, (to go to the
ular predicative nouns (e.g., grant in to grant a
                                                       city/mountain/America); or the medium of mo-
right).
                                                       tion, when combined with vehicles names, like in
IAV consists of a verb or VMWE and an idiomatic
                                                       vado in bici/Ferrari, (I ride my bike/drive my Fer-
selected preposition or postposition that is either
                                                       rari). However, with nouns denoting states, like
always required or, if absent, changes the meaning
                                                       andare in estasi (to become absorbed) or andare
of the verb of the VMWE significantly. IAVs are
                                                       in panico (to start feel panic), the verb acquires
verb+adposition combinations in which: i) the
                                                       the aspectual meaning of to go into the state X, and
dependents of the adposition are not lexicalized,
                                                       can not be classified as an LVC. With names refer-
or ii) the adposition cannot be omitted without
                                                       ring to events, instead, like andare in soccorso (lit.
markedly altering the meaning of the verb. During
                                                       to go in assistance), the original spatial semantics
the annotation trial, the IAV category has proved
                                                       bleaches by interacting with the name meaning:
to be advantageous to cover the rich inventory
                                                       actually to go into the event X denotes the action
of VMWEs in Italian, but some issues have also
                                                       expressed by the predicative name and can be clas-
emerged, particularly with respect to the other
                                                       sified as an LVC.
class of LVC verbs, which also accounts for com-
binations of verbs plus prepositions. Prototypical       5
                                                           Co-composition has been called ‘accommodation in
examples of IAV collected so far include the           more recent works (Pustejovsky, 2013).
6   Conclusions and Future Work                             expressions. In Proceedings of the 13th Workshop
                                                            on Multiword Expressions (MWE 2017), pages 31–
In this paper we described the novelties concern-           47.
ing the PARSEME shared task on automatic iden-
                                                          Simonetta Vietri. 2014a. Idiomatic Constructions in
tification of verbal MWEs - edition 1.1 (2018), in          Italian: A Lexicon-grammar Approach, volume 31.
which new verb categories have been included in             John Benjamins Publishing Company.
comparison with the 2017 edition. Some of them
                                                          Simonetta Vietri. 2014b. The lexicon-grammar of ital-
are language-specific, such as ICV for some Ro-
                                                            ian idioms. In Workshop on Lexical and Grammat-
mance languages, others are not, like IAV. The              ical Resources for Language Processing, COLING
increased number of categories enables to anno-             2014, pages 137–146.
tate corpus data more thoroughly, and discover
a broad range of combinatorial phenomena that
present different degrees of opacity.
    We also discussed two productive categories in
Italian, namely IRV and VID, and analyzed LVC
and IAV borderline cases together with observa-
tions on combinatorial phenomena that can be ap-
plied in order to annotate VMWE more effectively.
    Future work includes a further linguistic analy-
sis of borderline cases in order to contribute to the
description of these phenomena.

Acknowledgments
This research has been partly supported by the
European Regional Development Fund under the
grant KK.01.1.1.01.0009 (DATACROSS).
Authorship contribution is as follows: Johanna
Monti is author of Sections 1, 2, and 3; Valeria
Caruso of Section 5, and Maria Pia di Buono of
Sections 4 and 6.


References
Francesca Masini. 2015. Idiomatic verb-clitic con-
  structions: lexicalization and productivity.  In
  Mediterranean Morphology Meetings, volume 9,
  pages 88–104.

Johanna Monti, Maria Pia di Buono, and Federico San-
  gati. 2017. Parseme-it corpus an annotated corpus
  of verbal multiword expressions in italian. In Fourth
  Italian Conference on Computational Linguistics-
  CLiC-it 2017, pages 228–233. Accademia Univer-
  sity Press.

James Pustejovsky.    1995.   The generative lexicon.
  MIT Press.

James Pustejovsky. 2013. Type theory and lexical de-
  composition. In Advances in generative lexicon the-
  ory, pages 9–38. Springer.

Agata Savary, Carlos Ramisch, Silvio Cordeiro, Fed-
  erico Sangati, Veronika Vincze, Behrang Qasem-
  izadeh, Marie Candito, Fabienne Cap, Voula Giouli,
  Ivelina Stoyanova, et al. 2017. The parseme shared
  task on automatic identification of verbal multiword