=Paper= {{Paper |id=Vol-2006/paper023 |storemode=property |title=PARSEME-It Corpus An annotated Corpus of Verbal Multiword Expressions in Italian |pdfUrl=https://ceur-ws.org/Vol-2006/paper023.pdf |volume=Vol-2006 |authors=Johanna Monti,Maria Pia di Buono,Federico Sangati |dblpUrl=https://dblp.org/rec/conf/clic-it/MontiBS17 }} ==PARSEME-It Corpus An annotated Corpus of Verbal Multiword Expressions in Italian== https://ceur-ws.org/Vol-2006/paper023.pdf
                             PARSEME-It Corpus
          An annotated Corpus of Verbal Multiword Expressions in Italian
                           Johanna Monti1 , Maria Pia di Buono2 , Federico Sangati3

jmonti@unior.it, mariapia.dibuono@fer.hr, federico.sangati@gmail.com
  1
    Dep. of Literary, Linguistic and Comparative Studies “L’Orientale” University of Naples, Italy
                               2
                                 TakeLab - University of Zagreb, Croatia
                                     3
                                       Indipendent Researcher, Italy

                         Abstract                           within a European research network, to elabo-
                                                            rate universal terminologies and annotation guide-
       English. This paper describes a new lan-             lines for verbal multiword expressions in 18 lan-
       guage resource annotated with verbal mul-            guages, among which also the Italian language
       tiword expressions (VMWEs) in Italian.               is represented. Notably, multiword expressions
       The paper discusses the state of the art             represent a difficult lexical construction to iden-
       in VMWE identification and annotation in             tify, model and treat by Natural Language Process-
       Italian, the methodology adopted, the vari-          ing (NLP) tools, such as parsers, machine trans-
       ous VMWE categories annotated, the cor-              lation engines among others, mainly due to their
       pus and the annotation process. Finally,             non-compositional property. In particular, among
       the paper ends with results, conclusion and          multiword expressions verbal ones are particularly
       future work.                                         challenging because they have different syntactic
                                                            structures (prendere una decisione ’make a deci-
       Italiano.       Questo contributo descrive           sion’, decisioni prese precedentemente ’decisions
       una nuova risorsa linguistica annotata               made previously’), may be continuous and discon-
       con polirematiche verbali per la lin-                tinuous (andare e venire versus andare in malora
       gua italiana. Viene presentato lo stato              in Luigi ha fatto andare la società in malora), may
       dell’arte relativamente all’identificazione          have a literal and figurative meaning (abboccare
       ed all’annotazione di polirematiche per la           all’amo ’bite the hook’ or ’be deceived’). In this
       lingua italiana, la metodologia adottata,            paper, we describe the state of the art in VMWE
       le diverse categorie di polirematiche ver-           annotation and identification for the Italian lan-
       bali annotate nel corpus, il corpus stesso e         guage (section 2). We then present the method-
       il processo di annotazione. Infine vengono           ology (section 3), the Italian VMWE categories
       illustrati i risultati ottenuti, le conclusioni      taken into account for the annotation task (section
       e le prospettive future.                             4), the corpus and the annotation process (section
                                                            5), and the results (section 6). Finally, we discuss
                                                            conclusions and future work (section 7).
   1   Introduction
                                                            2   State of the art in VMWE
   This paper outlines the development of a                     identification and annotation in Italian
   new language resource for Italian, namely the
                                                            Several scholars have investigated different kinds
   PARSEME-It VMWE corpus, annotated with
                                                            of Italian VMWEs, focusing on both syntactic and
   Italian MWEs of a particular class: verbal mul-
                                                            semantic aspects. Among these works, we may
   tiword expressions (VMWE). The PARSEME-
                                                            distinguish contrastive and comparative analyses,
   It VMWE corpus has been developed by the
                                                            and synchronic and diachronic studies.
   PARSEME-IT research group1 in the framework
                                                            In the first group, most of the scholars propose a
   of the PARSEME Shared Task on Automatic
                                                            comparison with Germanic languages (Mateu and
   Identification of Verbal Multiword Expressions
                                                            Rigau, 2010), mainly for describing verb-particle
   (Savary et al., 2017), a joint effort, carried out
                                                            constructions, that represent a very common phe-
      1
        https://www.researchgate.net/project/PARSEME-IT-    nomenon in this family.
   Syntactic-Parsing-and-Multiword-Expressions-in-Italian   On the other hand, synchronic and diachronic
studies include analyses of: (i) verb-particle con-                      contains inherently reflexive verbs (IReflVs)
structions (Masini, 2005; Iacobini and Masini,                           and verb-particle constructions (VPCs);
2005; Quaglia and Trotzke, 2017), (ii) idiomatic
                                                                      3. an other VMWEs category, which is a resid-
constructions (Tabossi et al., 2011; Vietri, 2014c)
                                                                         ual category for the occurrences not belong-
with either ordinary or support verbs (Vietri,
                                                                         ing to any of the previous groups.
2014b), (iii) support, or light, verbs, which rep-
resent a wider phenomenon and, for this reason,                   In order to ease the identification and categori-
they have been largely analysed (La Fauci, 1980;                  sation task of VMWEs, a decision tree method
D’Agostino and Elia, 1998; Cicalese, 1999; Alba-                  was devised with generic and language-specific
Salas, 2004; Quochi, 2007; Cicalese et al., 2016).                tests. Generic tests consider general criteria that
Reflexive verbs in Italian have been investigated                 are valid for all languages, while language-specific
as occurrences of non-local anaphora (Reuland,                    tests consider structural, lexical, morphological
1990) and considering their syntactic classification              and syntactic features that are specific for the indi-
(Carstea Romascanu, 1977).                                        vidual languages. The decision tree includes three
To the best of our knowledge only a limited num-                  steps, (i) identification of a VMWE candidate, i.e.,
ber of monolingual language resources with mul-                   a combination of a verb with at least one other
tiwords for the Italian language have been devel-                 word, which is a potential VMWE; (ii) identifi-
oped such as a dictionary for Italian idioms (Vietri,             cation of the lexicalized elements of the expres-
2014a), a series of example corpora and a database                sion, (iii) assignment of the VMWE to one of the
of MWEs represented around morphosyntactic                        VMWE categories, using general and language-
patterns (Zaninello and Nissim, 2010), or a cor-                  specific tests.
pus annotated with Italian MWEs of a particular
class: verb-noun expressions such as fare riferi-                 4     Italian VMWEs
mento, dare luogo and prendere atto (Taslimipoor                  For the Italian VMWE annotation task, according
et al., 2016). At the time of writing, therefore, the             to PARSEME guidelines, multiword expressions
PARSEME-It VMWE corpus represents the first                       are understood as (continuous or discontinuous)
sample of a corpus, which includes several types                  sequences of words with the following compul-
of VMWEs, specifically developed for NLP appli-                   sory properties:
cations.
                                                                      • Their component words include a head word
3     Methodology                                                       and at least one other syntactically related
                                                                        word. Most often the relation they maintain
The development of the Italian VMWE corpus is                           is a syntactic (direct or indirect) dependency
based on the PARSEME annotation guidelines2 ,                           but it can also be e.g., a coordination.
provided for the shared task. The guidelines have
been developed with the aim of delivering gen-                        • They show some degree of orthographic,
eral definitions and prescriptions for the annota-                      morphological, syntactic or semantic id-
tion of VMWEs in 18 languages, but, at the same                         iosyncrasy with respect to what is considered
time, of allowing language-specific descriptions of                     general grammar rules of a language.
these linguistic phenomena (Savary et al., 2017).                     • At least two components of such a word se-
The annotation guidelines include three main cat-                       quence have to be lexicalized.
egories:
                                                                  In this task we only annotate the lexicalized com-
    1. a universal category, which is common to                   ponents and ignore open slots. Collocations, i.e.,
       all the languages involved in the task and                 word co-occurrences whose idiosyncrasy is of sta-
       holds light-verb constructions (LVCs) and id-              tistical nature only (e.g., the graphic shows, dras-
       ioms (ID);                                                 tically drop, etc.), are excluded from the scope of
                                                                  this study. The VMWE which have been anno-
    2. a quasi-universal category, relevant for                   tated for the Italian language are:
       some languages or language families, that                      1. Light verb constructions (LVC), which typ-
   2
     The guidelines are available at http://parsemefr.lif.univ-          ically consist of a verb and a noun or prepo-
mrs. fr/guidelines-hypertext/.                                           sitional phrase, e.g., fare una domanda (’to
       make a question’), fare una passeggiata (’to        CoNLL format, i.e. lemmatized, POS-tagged and
       have a walk’). The verb has a purely syntac-        annotated with syntactic dependencies. For our
       tic operator function (performing an activity       annotation task, we selected a sub-corpus formed
       or being in a state), whereas the noun is pred-     by 17,000 sentences (corresponding to 421,848 to-
       icative, often referring to an event (e.g., deci-   kens) randomly taken from blogs, Wikipedia and
       sion, visit) or a state (e.g., fear, courage);      Wikinews. The corpus was kept in its original state
                                                           and therefore no errors or inconsistencies were
    2. Idioms (ID), which have at least two lexical-       corrected. The pre-annotation of the PAISA´ was
       ized components including a head verb and at        kept in order to ease the annotation work with ref-
       least one of its arguments, e.g., tirare le cuoia   erence to the identification of verbal MWEs but
       (’kick the bucket’), piovere a catinelle (’rain     we asked annotators not to overestimate the sys-
       cats and dogs’);                                    tem’s performances, and to review the whole text,
                                                           not only the pre-annotated candidates proposed by
    3. Inherently reflexive verbs (IReflV), which
                                                           the system. A dedicated tag in FLAT was defined
       are those reflexive verbal constructions which
                                                           for this purpose. The objective was to have a fi-
       (a) never occur without the clitic e.g., suici-
                                                           nal corpus of at least 3,500 annotated VMWEs per
       darsi (’suicide’), or when (b) the REFLV and
                                                           language. Since the density of VMWEs highly de-
       non-reflexive versions have clearly differ-
                                                           pend on the particular language, as well as text
       ent senses or subcategorization frames e.g.,
                                                           choice and genre, we were not able to make any re-
       riferirsi (’refer’);
                                                           liable estimation of the corpus size needed to reach
    4. Verb particle combinations (VPC), which             this goal from the beginning of the task.
       are formed by a lexicalized head verb
                                                           5.2    Annotation environment
       and a lexicalized particle dependent on the
       verb. The meaning of the VPC is non-                The annotation environment used for the
       compositional. Notably, the change in the           PARSEME-It VMWE corpus is FLAT, a web-
       meaning of the verb goes significantly be-          based linguistic annotation environment3 based
       yond adding the meaning of the particle, e.g.,      around the FoLiA format4 a rich XML-based
       buttare giù (’swallow’). This type of con-         format for linguistic annotation. FLAT allows
       struction is very frequent in English, German,      users to view annotated FoLiA documents and
       Swedish, Hungarian, but we can find them            enrich these documents with new annotations
       also in Italian;                                    (Figure 1), a wide variety of linguistic annotation
                                                           types is supported through the FoLiA paradigm. It
    5. Other Verbal MWEs (OTH), which gather               is a document-centric tool that fully preserves and
       the types not belonging to any of the cat-          visualises document structure. It is open source
       egories above, e.g., corto-circuitare (’short-      software developed at the Centre of Language
       circuit’).                                          and Speech Technology, Radboud University
                                                           Nijmegen and is licensed under the GNU Public
5     Corpus and annotation task                           License v3.
5.1    PARSEME Italian VMWE corpus                         5.3    Annotation task
The PARSEME-It VMWE corpus is based on a                   The annotation task for the Italian language was
selection of texts taken from the PAISA´ corpus of         performed in five different stages.
Italian web texts (Lyding et al., 2014). We chose
this corpus because it contains documents (i) from           1. The PARSEME Annotation guidelines were
different web sources, e.g., Wikibooks, Wikinews,               agreed on5 and examples for the Italian lan-
Wikiversity, and several blog services from dif-                guage were added in order to ease the anno-
ferent websites, collected in 2010 by means of                  tation task by the Italian annotators. To this
a Creative Commons-focused web crawling, and                    end, a two-phase pilot annotation in Italian
a targeted collection of documents from specific              3
                                                                http://flat.readthedocs.io/en/latest/
websites, (ii) dedicated to no specific technical             4
                                                                http://proycon.github.io/folia
domain, free from copyright issues, so as to be               5
                                                                http://parsemefr.lif.univ-mrs.fr/parseme-st-
compatible with an open license (iii) annotated in         guidelines/1.0/?page=home
                             Figure 1: Example of annotated data in FLAT


   was carried out. This step was useful in iden-               (ii) the category attribution concerning for in-
   tifying the Italian VMWE categories to be an-                stance the fare + N VMWE type, since in
   notated, but also to promote cross-language                  some cases the category is LVC such as in
   convergences with the other languages fore-                  fare rumore and in some others is ID such as
   seen in the shared task. Each pilot annotation               in fare schifo, (iii) the identification of nested
   phase provided feedback from annotators and                  VMWEs like in Mi guardo bene where the
   was followed by enhancements of the guide-                   annotator has to decide if in the ID guardarsi
   lines, corpus format and processing tools.                   bene there is also a IReflV guardarsi or not.

2. A pre-processing step of the PAISA´ corpus              4. A few files were double-annotated to evaluate
   was needed: a ’no space’ column was added                  the inter-annotator agreement (IAA). Mea-
   to the files in order to add the ’nsp’ tag if a            suring IAA is not a trivial task because of the
   token should have been appended to the pre-                challenges posed by VMWEs and described
   vious one without a space.                                 in the Introduction. The available IAA re-
                                                              sults organized per-VMWE F-score (Funit ),
                                                              estimated Cohens K (Kunit ) and finally stan-
                                                              dard K(Kcat ) (Savary et al., 2017) scores are
                                                              presented in Table 1.

                                                           5. Further 1,000 sentences were used as test-set
                                                              during the shared task. The VMWE anno-
                                                              tations were automatically annotated by the
                                                              systems that took part in the shared task and
                                                              performed according to the same guidelines.

                                                                 #S   #T    #A1       #A2   Funit Kunit Kcat
                                                           IT    2000 52639 336       316   0.417 0.331 0.78

                                                       Table 1: AA scores for Italian annotation: #S,
 Figure 2: Example of the use of an nsp tag            and #T show the number of sentences and tokens
                                                       in the corpora used for measuring the IAA, re-
                                                       spectively. #A1 and #A2 refer to the number of
3. The annotation task of the training set (ap-        VMWE instances annotated by each of the anno-
   prox. 16,000 sentences) was manually per-           tators (Savary et al., 2017).
   formed in running texts using the FLAT envi-
   ronment by five Italian native speakers with
   linguistic background. Each annotator was           6       Results
   given a certain number of files, containing
                                                       The PARSEME-It VMWE corpus is composed of
   1,000 sentences in CoNLL format. All the
                                                       2,454 entries (Table 2), and it is freely available6 ,
   doubts about the annotation were collected in
                                                       released under Creative Commons licenses.
   a shared file and discussed during the annota-
                                                          The data have been annotated using the official
   tion phase. Difficulties in annotating VMWE
                                                       parseme-tsv format7 (Figure 3), adapted from the
   mainly concerned (i) the boundaries of the
                                                       CoNLL format.
   VMWE such as in Sei ovviamente nel pieno
                                                           6
   diritto di esprimere [...] where it is diffi-           http://hdl.handle.net/11372/LRT-2282
                                                           7
                                                           http://typo.uni-konstanz.      de/parseme/index.php/2-
   cult to decide if the VMWE should be sei            general/        184-parseme-shared-task-format-of-the-final-
   ... nel ... diritto or sei ... nel pieno diritto,   annotation.
               Category       Occurrences              leased. These companion files contain extra lin-
               ID                    1163              guistic information, i.e., lemmas, POS-tags, mor-
               IReflV                 730              phological features, and syntactic dependencies.
               LVC                    482
               VPC                     73              7   Conclusion and Future Work
               OTH                      6
               Total                 2454              In this paper, we described a linguist resource of
                                                       Italian VMWE, developed within the PARSEME
                                                       Shared Task on Automatic Identification of
Table 2: Overview of VMWEs in the PARSEME-
                                                       VMWE. We consider this work an initial contribu-
It VMWE corpus, including train and test sets.
                                                       tion for elaborating an Italian universal terminol-
                                                       ogy of VMWE. Future work includes the exten-
                                                       sion of the current corpus and a fine-grained lin-
                                                       guistic analysis of the annotation in order to con-
                                                       tribute to the description of these phenomena.

                                                       Acknowledgments
                                                       The work described in this paper has been sup-
                                                       ported by the IC1207 PARSEME COST action.
                                                       The annotation work was carried out also thanks
                                                       to the help of Maarten van Gompel who adapted
                                                       the FLAT annotation platform to the needs of the
                                                       community.
                                                       Our thanks go also to the Italian annotators, Va-
                                                       leria Caruso, Manuela Cherchi, Anna De Santis,
                                                       Annalisa Raffone, for their contributions.
                                                       Autorship contribution is as follows: Johanna
                                                       Monti is author of Sections 1, 3, 4 and 5.3; Maria
                                                       Pia di Buono of Sections 2 and 6 and 7; and Fed-
Figure 3: Example of annotated data in parseme-        erico Sangati of Sections 5.1 and 5.2.
tsv format

   In the official parseme-tsv format, as described    References
in Savary et al. (2017), the information about each    Josep Alba-Salas. 2004. Fare light verb construc-
token are represented by 4 tab-separated columns          tions and italian causatives: Understanding the dif-
featuring (i) the position of the token in the sen-       ferences. ITALIAN JOURNAL OF LINGUISTICS,
tence or a range of positions (e.g., 1-2) in case of      16(2):283.
multiword tokens such as contractions, (ii) the to-    M Carstea Romascanu. 1977. I tipi di verbi riflessivi
ken surface form, (iii) an optional flag indicating      in italiano. Revue Roumaine de Linguistique Bu-
that the current token is adjacent to the next one,      curesti, 22(2):125–130.
and (iv) an optional VMWE code composed of the
VMWEs consecutive number in the sentence and           Anna Cicalese, Emilio D’Agostino, Alberto Maria
                                                         Langella, and Ilaria Villari. 2016. Els verbs lo-
for the initial token in a VMWE its category (e.g.,      catius com a variants de verbs de suport. Quaderns
2:ID if a token starts an idiom which is the sec-        d’Italià, 21:153–166.
ond VMWE in the current sentence). In case of
nested, coordinated or overlapping VMWEs mul-          Anna Cicalese. 1999. Le estensioni di verbo supporto.
                                                         uno studio introduttivo. Studi italiani di linguistica
tiple codes are separated with a semicolon. Fur-         teorica ed applicata, 28(3):447–485.
thermore, in order to provide data usable as fea-
tures in the shared task systems, also companion       Emilio D’Agostino and Annibale Elia. 1998. Il signi-
files in a format close to CoNLL-U8 have been re-        ficato delle frasi: un continuum dalle frasi semplici
                                                         alle forme polirematiche. AA. VV, Ai limiti del lin-
   8
       http://universaldependencies.org/format.htm       guaggio. Bari: Laterza, pages 287–310.
Claudio Iacobini and Francesca Masini. 2005. Verb-         Simonetta Vietri. 2014b. Idiomatic Constructions in
  particle constructions and prefixed verbs in italian:      Italian: A Lexicon-grammar Approach, volume 31.
  typology, diachrony and semantics. In Mediter-             John Benjamins Publishing Company.
  ranean Morphology Meetings, volume 5, pages
  157–184.                                                 Simonetta Vietri. 2014c. The lexicon-grammar of ital-
                                                             ian idioms. In Workshop on Lexical and Grammat-
Nunzio La Fauci. 1980. Aspects du mouvement de               ical Resources for Language Processing, COLING
  wh, verbes supports, double analyse, complétives au       2014, pages 137–146.
  subjonctif en italien: pour une description compacte.
  Lingvisticae Investigationes, 4(2):293–341.              Andrea Zaninello and Malvina Nissim. 2010. Creation
                                                             of lexical resources for a characterisation of multi-
Verena Lyding, Egon Stemle, Claudia Borghetti, Marco         word expressions in italian. In LREC.
  Brunello, Sara Castagnoli, Felice DellOrletta, Hen-
  rik Dittmann, Alessandro Lenci, and Vito Pirrelli.
  2014. The paisa corpus of italian web texts. In Pro-
  ceedings of the 9th Web as Corpus Workshop (WaC-
  9), pages 36–43.
Francesca Masini. 2005. Multi-word expressions be-
  tween syntax and the lexicon: the case of italian
  verb-particle constructions. SKY Journal of Linguis-
  tics, 18(2005):145–173.
Jaume Mateu and Gemma Rigau. 2010. Verb-particle
   constructions in romance: A lexical-syntactic ac-
   count. Probus, 22(2):241–269.
Stefano Quaglia and Andreas Trotzke. 2017. Italian
   verb particles and clausal positions. In IATL 31: The
   31st annual meeting Israel Association for Theoret-
   ical Linguistics, pages 67–82.
Valeria Quochi. 2007. A usage-based approach to light
  verb constructions in italian: Development and use.
Eric Reuland. 1990. Reflexives and beyond: Non-local
   anaphora in italian revisited. Grammar in progress:
   glow essays for Henk van Riemsdijk, 36:351.
Agata Savary, Carlos Ramisch, Silvio Cordeiro, Fed-
  erico Sangati, Veronika Vincze, Behrang Qasem-
  izadeh, Marie Candito, Fabienne Cap, Voula Giouli,
  Ivelina Stoyanova, et al. 2017. The parseme shared
  task on automatic identification of verbal multiword
  expressions. In Proceedings of the 13th Workshop
  on Multiword Expressions (MWE 2017), pages 31–
  47.
Patrizia Tabossi, Lisa Arduino, and Rachele Fa-
  nari. 2011. Descriptive norms for 245 italian id-
  iomatic expressions. Behavior Research Methods,
  43(1):110–123.
Shiva Taslimipoor, Anna Desantis, Manuela Cherchi,
  Ruslan Mitkov, and Johanna Monti. 2016. Lan-
  guage resources for italian: towards the develop-
  ment of a corpus of annotated italian multiword ex-
  pressions. In Proceedings of Third Italian Confer-
  ence on Computational Linguistics (CLiC-it 2016)
  & Fifth Evaluation Campaign of Natural Language
  Processing and Speech Tools for Italian. Final Work-
  shop (EVALITA 2016). ceur-ws.
Simona Vietri. 2014a. The italian module for nooj. In
  Proceedings of the First Italian Conference on Com-
  putational Linguistics, CLiC-it, pages 389–393.