=Paper= {{Paper |id=Vol-1749/paper_013 |storemode=property |title=Overview of the EVALITA 2016 Part Of Speech on TWitter for ITAlian Task |pdfUrl=https://ceur-ws.org/Vol-1749/paper_013.pdf |volume=Vol-1749 |authors=Cristina Bosco,Fabio Tamburini,Andrea Bolioli,Alessandro Mazzei |dblpUrl=https://dblp.org/rec/conf/clic-it/BoscoTBM16 }} ==Overview of the EVALITA 2016 Part Of Speech on TWitter for ITAlian Task== https://ceur-ws.org/Vol-1749/paper_013.pdf
                          Overview of the EVALITA 2016
                    Part Of Speech on TWitter for ITAlian Task

                Cristina Bosco                           Fabio Tamburini,
   Dip. di Informatica, Università di Torino    FICLIT, University of Bologna, Italy
           bosco@di.unito.it                     fabio.tamburini@unibo.it

                 Andrea Bolioli                            Alessandro Mazzei
                    CELI                        Dip. di Informatica, Università di Torino
              abolioli@celi.it                         mazzei@di.unito.it


                   Abstract                      1   Introduction and motivation
English. The increasing interest for the         In the past the effort on Part-of-Speech (PoS) tag-
extraction of various forms of knowledge         ging has mainly focused on texts featured by stan-
from micro-blogs and social media makes          dard forms and syntax. However, in the last few
crucial the development of resources and         years the interest in automatic evaluation of social
tools that can be used for automatically         media texts, in particular from microblogging such
deal with them. PoSTWITA contributes to          as Twitter, has grown considerably: the so-called
the advancement of the state-of-the-art for      user-generated contents have already been shown
Italian language by: (a) enriching the com-      to be useful for a variety of applications for identi-
munity with a previously not existing col-       fying trends and upcoming events in various fields.
lection of data extracted from Twitter and
                                                    As social media texts are clearly different
annotated with grammatical categories, to
                                                 from standardized texts, both regarding the na-
be used as a benchmark for system evalu-
                                                 ture of lexical items and their distributional prop-
ation; (b) supporting the adaptation of Part
                                                 erties (short messages, emoticons and mentions,
of Speech tagging systems to this particu-
                                                 threaded messages, etc.), Natural Language Pro-
lar text domain.
                                                 cessing methods need to be adapted for deal with
Italiano.       La crescente rilevanza           them obtaining reliable results in processing. The
dell’estrazione di varie forme di                basis for such an adaption are tagged social me-
conoscenza da testi derivanti da micro-          dia text corpora (Neunerdt et al., 2013) for train-
blog e social media rende cruciale lo            ing and testing automatic procedures. Even if
sviluppo di strumenti e risorse per il           various attempts to produce such kind of spe-
trattamento automatico. PoSTWITA si              cialised resources and tools are described in lit-
propone di contribuire all’avanzamento           erature for other languages (e.g. (Gimpel et al.,
dello stato dell’arte per la lingua ital-        2011; Derczynski et al., 2013; Neunerdt et al.,
iana in due modi: (a) fornendo alla              2013; Owoputi et al., 2013)), Italian currently
comunità una collezione di dati estratti        completely lacks of them both.
da Twitter ed annotati con le categorie             For all the above mentioned reasons, we pro-
grammaticali, risorsa precedentemente            posed a task for EVALITA 2016 concerning the
non esistente, da utlizzare come banco           domain adaptation of PoS-taggers to Twitter texts.
di prova nella valutazione di sistemi;           Participants to the evaluation campaign were re-
(b) promuovendo l’adattamento a questo           quired to use the two following data sets provided
particolare dominio testuale dei sistemi di      by the organization to set up their systems: the
Part of Speech tagging che partecipano al        first one, henceforth referred to as Development
task.                                            Set (DS), contains data manually annotated using
                                                 a specific tagset (see section 2.2 for the tagset de-
                                                 scription) and must be used to train participants
Authors order has been decided by coin toss.     systems; the second one, referred to as Test Set
(TS), contains the test data in blind format for the     the TS, is instead composed by 300 tweets (4,759
evaluation and has been given to participants in the     tokens).
date scheduled for the evaluation.                          The tokenisation and annotation of all data have
For better focusing the task on the challenges re-       been first carried out by automatic tools, with a
lated to PoS tagging, but also for avoiding the bor-     high error rate which is motivated by the features
ing problem of disappeared tweets, the distributed       of the domain and text genre. We adapted the
version of tweets has been previously tokenised,         Tweet-NLP tokeniser (Gimpel et al., 2011) to Ital-
splitting each token on a different line.                ian for token segmentation and used the TnT tag-
Moreover, according to an “open task” perspec-           ger (Brants, 2000) trained on the Universal Depen-
tive, participants were allowed to use other re-         dencies corpus (v1.3) for the first PoS-tagging step
sources with respect to those released for the           (see also section 2.2).
task, both for training and to enhance final perfor-        The necessary manual correction has been ap-
mances, as long as their results apply the proposed      plied by two different skilled humans working in-
tagsets.                                                 dependently on data. The versions produced by
   The paper is organized as follows. The next sec-      them have been compared in order to detect dis-
tion describes the data exploited in the task, the       agreements, conflicts or residual errors which have
annotation process and the issues related to the to-     been finally resolved by the contribution of a third
kenisation and tagging applied to the dataset. The       annotator.
following section is instead devoted to the descrip-     Nevertheless, assuming that the datasets of PoST-
tion of the evaluation metrics and participants re-      WITA are developed from scratch for what con-
sults. Finally, we discuss the main issues involved      cerns the tokenisation and annotation of grammat-
in PoSTWITA.                                             ical categories, we expected the possible presence
                                                         of a few residual errors also after the above de-
2   Data Description                                     scribed three phases of the annotation process.
                                                         Therefore, during the evaluation campaign, and
For the corpus of the proposed task, we collected
                                                         before the date scheduled for the evaluation, all
tweets being part of the EVALITA2014 SEN-
                                                         participants were invited and encouraged to com-
TIment POLarity Classification (SENTIPOLC)
                                                         municate to the organizers any errors found in the
(Basile et al., 2014) task dataset, benefitting of the
                                                         DS. This allowed the organizers (but not the par-
fact that it is cleaned from repetitions and other
                                                         ticipants) to update and redistribute it to the par-
possible sources of noise. The SENTIPOLC cor-
                                                         ticipants in an enhanced form.
pus originates from a set of tweets (Twita) ran-
domly collected (Basile et al., 2013), and a set            No lexical resource has been distributed with
of posts extracted exploiting specific keywords          PoSTWITA 2016 data, since each participant is al-
and hashtags marking political topics (SentiTUT)         lowed to use any available lexical resource or can
(Bosco et al., 2013).                                    freely induce it from the training data.
   In order to work in a perspective of the devel-          All the data are provided as plain text files in
opment of a benchmark where a full pipeline of           UNIX format (thus attention must be paid to new-
NLP tools can be applied and tested in the future,       line character format), tokenised as described in
the same selection of tweets has been exploited          section 2.1, but only those of the DS have been
in other EVALITA2016 tasks, in particular in the         released with the adequate PoS tags described in
EVALITA 2016 SENTiment POLarity Classifica-              section 2.2. The TS contains only the tokenised
tion Task (SENTIPOLC) (Barbieri et al., 2016),           words but not the correct tags, that have to be
Named Entity rEcognition and Linking in Italian          added by the participant systems to be submit-
Tweets (NEEL-IT) (Basile et al., 2016) and Event         ted for the evaluation. The correct tokenised and
Factuality Annotation Task (FactA) (Minard et            tagged data of the TS (called gold standard TS),
al., 2016).                                              exploited for the evaluation, has been provided to
Both the development and test set of                     the participants after the end of the contest, to-
EVALITA2016 has been manually annotated                  gether with their score.
with PoS tags. The former, which has been                   According to the treatment in the dataset from
distributed as the DS for PoSTWITA, includes             where our data are extracted, each tweet in PoST-
6,438 tweets (114,967 tokens). The latter, that is       WITA corpus is considered as a separate entity and
we did not preserved thread integrity, thus taggers      with the other UD datasets, and strongly improves
participating to the contest have to process each        the portability of our newly developed datasets to-
tweet separately.                                        wards this standard.
                                                         Assuming, as usual and more suitable in PoS tag-
2.1     Tokenisation Issues                              ging, a neutral perspective with respect to the so-
The problem of text segmentation (tokenisation) is       lution of parsing problems (more relevant in build-
a central issue in PoS-tagger evaluation and com-        ing treebanks), we differentiated our format from
parison. In principle, for practical applications,       that one applied in UD, by maintaining the word
every system should apply different tokenisation         unsplitted rather than splitted in different tokens,
rules leading to different outputs.                      also in the two following cases:
   We provided in the evaluation campaign all the
development and test data in tokenised format,             • for the articulated prepositions (e.g. dalla
one token per line followed by its tag (when                 (from-the[fem]), nell´ (in-the[masc]), al (to-
applicable), following the schema:                           the), ...)

         ID TWEET 1           162545185920778240           • for the clitic clusters, which can be attached
            Governo PROPN
            Monti PROPN
                                                             to the end of a verb form (e.g. regalaglielo
            : PUNCT                           (gift-to-him-it), dandolo (giving-it), ...)
            decreto NOUN
            in ADP
            cdm PROPN                     For this reason, we decided also to define two
            per ADP
            approvazione NOUN             novel specific tags to be assigned in these cases
            ! PUNCT                       (see section 1): ADP A and VERB CLIT re-
          http://t.co/Z76KLLGP URL
                                                         spectively for articulated prepositions and clitics,
         ID TWEET 2            192902763032743936
            #Ferrara HASHTAG              according to the strategy assumed in previous
            critica VERB
            #Grillo HASHTAG
                                                         EVALITA PoS tagging evaluations.
            perché SCONJ                    The participants are requested to return the test
            ...
                                                         file using exactly the same tokenisation format,
   The first line for each tweet contains the Tweet      containing exactly the same number of tokens.
ID, while the line of each tweet after the last one is   The comparison with the reference file will be per-
empty, in order to separate each post from the fol-      formed line-by-line, thus a misalignment will pro-
lowing. The example above shows some tokenisa-           duce wrong results.
tion and formatting issues, in particular:
                                                         2.2   Tagset
  • accents, which are coded using UTF-8 encod-
    ing table;                                           Beyond the introduction of the novel labels cited
                                                         above, motivated by tokenisation issues and re-
  • apostrophe, which is tokenised separately            lated to articulated prepositions and clitic clus-
    only when used as quotation mark, not                ters, for what concerns PoS tagging labels, fur-
    when signalling a removed character (like in         ther modifications with respect to UD standard
    dell’/orto)                                          are instead motivated by the necessity of more
All the other features of data annotation are de-        specific labels to represent particular phenomena
scribed in details in the following parts of this sec-   often occurring in social media texts. We in-
tion.                                                    troduced therefore new Twitter-specific tags for
                                                         cases that following the UD specifications should
   For what concerns tokenisation and tagging            be all classified into the generic SYM (symbol)
principles in EVALITA2016 PoSTWITA, we de-               class, namely emoticons, Internet addresses, email
cided to follow the strategy proposed in the Uni-        addresses, hashtags and mentions (EMO, URL,
versal Dependencies (UD) project for Italian1 ap-        EMAIL, HASHTAG and MENTION). See Table
plying only minor changes, which are motivated           1 for a complete description of the PoSTWITA
by the special features of the domain addressed          tagset.
in the task. This makes the EVALITA2016-                    We report in the following the more challenging
PoSTWITA gold standard annotation compliant              issues addressed in the development of our data
  1
    http://universaldependencies.org/it/                 sets, i.e. the management of proper nouns and of
pos/index.html                                           foreign words.
              Tagset                  Category                        Examples
        UD     PoSTWITA16                                             if different from UD specs
        ADJ    ADJ                    Adjective                       -
        ADP    ADP                    Adposition (simple prep.)       di, a, da, in, con, su, per
               ADP A                  Adposition (prep.+Article)      dalla, nella, sulla, dell
        ADV    ADV                    Adverb                          -
        AUX    AUX                    Auxiliary Verb                  -
        CONJ   CONJ                   Coordinating Conjunction        -
        DET    DET                    Determiner                      -
        INTJ   INTJ                   Interjection                    -
        NOUN NOUN                     Noun                            -
        NUM    NUM                    Numeral                         -
        PART   PART                   Particle                        -
        PRON   PRON                   Pronoun                         -
        PROPN PROPN                   Proper Noun                     -
        PUNCT PUNCT                   punctuation                     -
        SCONJ SCONJ                   Subordinating Conjunction       -
        SYM    SYM                    Symbol                          -
               EMO                    Emoticon/Emoji                  :-) ˆ ˆ ♥ :P
               URL                    Web Address                     http://www.somewhere.it
               EMAIL                  Email Address                   someone@somewhere.com
               HASHTAG                Hashtag                         #staisereno
               MENTION                Mention                         @someone
        VERB   VERB                   Verb                            -
               VERB CLIT              Verb + Clitic pronoun cluster   mangiarlo, donarglielo
        X      X                      Other or RT/rt                  -

                              Table 1: EVALITA2016 - PoSTWITA tagset.


2.2.1 Proper Noun Management                              Nevertheless, in some other cases, the upper-
The annotation of named entities (NE) poses a          case letter has not been considered enough to de-
number of relevant problems in tokenisation and        termine the introduction of a PROPN tag:
PoS tagging. The most coherent way to handle           “...anche nei Paesi dove..., “...in contraddizione
such kind of phenomena is to consider each NE          con lo Stato sociale...”.
as a unique token assigning to it the PROPN tag.       This strategy is devoted to produce a data set that
Unfortunately this is not a viable solution for this   incorporates the speakers linguistic intuition about
evaluation task, and, moreover, a lot of useful        this kind of structures, regardless of the possibil-
generalisation on n-gram sequences (e.g. Minis-        ity of formalization of the involved knowledge in
tero/dell/Interno PROPN/ADP A/PROPN) would             automatic processing.
be lost if adopting such kind of solution. Anyway,
                                                       2.2.2   Foreign words
the annotation of sequences like Banca Popolare
and Presidente della Repubblica Italiana deserve       Non-Italian words are annotated, when possible,
some attention and a clear policy.                     following the same PoS tagging criteria adopted in
Following the approach applied in Evalita 2007 for     UD guidelines for the referring language. For in-
the PoS tagging task, we annotate as PROPN those       stance, good-bye is marked as an interjection with
words of the NE which are marked by the upper-         the label INTJ.
case letter, like in the following examples:
                                                       3   Evaluation Metrics
 Banca PROPN       Presidente PROPN    Ordine PROPN
 Popolare PROPN    della ADP A         dei ADP A       The evaluation is performed in a black box ap-
                   Repubblica PROPN    Medici PROPN    proach: only the systems output is evaluated. The
                   Italiana PROPN                      evaluation metric will be based on a token-by-
    Team ID         Team                         Affiliations
    EURAC           E.W. Stemle                  Inst. for Specialised Commun. and Multilingualism,
                                                 EURAC Research, Bolzano/Bozen, Italy
    ILABS           C. Aliprandi, L De Mattei    Integris Srl, Roma, Italy
    ILC-CNR         A. Cimino, F. Dell’Orletta   Istituto di Linguistica Computazionale Antonio Zampolli
                                                 CNR, Pisa, Italy
    MIVOQ           Giulio Paci                  Mivoq Srl, Padova, Italy
    NITMZ           P. Pakray, G. Majumder       Deptt. of Computer Science & Engg., Nat. Inst. of Tech.,
                                                 Mizoram,Aizawl, India
    UniBologna      F. Tamburini                 FICLIT, University of Bologna, Italy
    UniDuisburg     T. Horsmann, T. Zesch        Language Technology Lab Dept. of Comp. Science and
                                                 Appl. Cog. Science, Univ. of Duisburg-Essen, Germany
    UniGroningen    B. Plank, M. Nissim          University of Groningen, The Nederlands
    UniPisa         G. Attardi, M. Simi          Dipartimento di Informatica, Universit di Pisa, Italy

                   Table 2: Teams participating at the EVALITA2016 - PoSTWITA task.


token comparison and only a single tag is allowed           matic PoS-taggers when annotating tweets
for each token. The considered metric is the Tag-           are lower than when working on normal texts,
ging accuracy: it is defined as the number of cor-          but are in line with the state-of-the art for
rect PoS tag assignment divided by the total num-           other languages;
ber of tokens in TS.
                                                         • all the top-performing systems are based
4     Teams and Results                                    on Deep Neural Networks and, in particu-
                                                           lar, on Long Short-Term Memories (LSTM)
16 teams registered for this task, but only 9 sub-         (Hochreiter, Schmidhuber, 1997; Graves,
mitted a final run for the evaluation. Table 2 out-        Schmidhuber, 1997);
lines participants’ main data: 7 participant teams
belong to universities or other research centres and     • most systems use word or character embed-
the last 2 represent private companies working in          dings as inputs for their systems;
the NLP and speech processing fields.
                                                         • more or less all the presented systems make
   Table 3 describes the main features of the eval-
                                                           use of additional resources or knowledge
uated systems w.r.t. the core methods and the ad-
                                                           (morphological analyser, additional tagged
ditional resources employed to develop the pre-
                                                           corpora and/or large non-annotated twitter
sented system.
                                                           corpora).
   In the Table 4 we report the final results of the
PoSTWITA task of the EVALITA2016 evaluation               Looking at the official results, and comparing
campaign. In the submission of the result, we al-      them with the experiments that the participants
low to submit a single “official” result and, op-      devised to set up their own system (not reported
tionally, one “unofficial” result (“UnOFF” in the      here, please look at the participants’ reports), it
table): UniBologna, UniGroningen, UnPisa and           is possible to note the large difference in perfor-
UniDuisburg decided to submit one more unof-           mances. During the setup phase most systems,
ficial result. The best result has been achieved       among the top-performing ones, obtained coher-
by the ILC-CNR group (93.19% corresponding to          ent results well above 95/96% of accuracy on the
4, 435 correct tokens over 4, 759).                    development set (either splitting it into a train-
                                                       ing/validation pair or by making cross-validation
5     Discussion and Conclusions                       tests), while the best performing system in the
Looking at the results we can draw some provi-         official evaluation exhibit performances slightly
sional conclusions about the PoS-tagging of Ital-      above 93%. It is a huge difference for this kind
ian tweets:                                            of task, rarely observed in literature.
                                                          One possible reason that could explain this dif-
     • as expected, the performances of the auto-      ference in performances regards the kind of docu-
 Team ID          Core methods                              Resources (other than DS)
 EURAC            LSTM NN                                   DiDi-IT
                  (word&char embeddings)
 ILABS            Perceptron algorithm                      word features extracted from proprietary
                                                            resources and 250k entries of wikitionary.
 ILC-CNR          two-branch BiLSTM NN                      Morhological Analyser (65,500 lemmas) +
                  (word&char embeddings)                    ItWaK corpus
 MIVOQ            Tagger combination based on Yamcha        Evalita2009 Pos-tagged data
                                                            ISTC pronunciation dictionary
 NITMZ            HMM bigram model                          -
 UniBologna       Stacked BiLSTM NN + CRF                   Morhological Analyser (110,000 lemmas) +
                  (augmented word embeddings)               200Mw twitter corpus
 UniDuisburg      CRF classifier                            400Mw Twitter corpus
 UniGroningen     BiLSTM NN                                 Universal Dependencies v1.3
                  (word embedding)                          74 kw tagged Facebook corpus
 UniPisa          BiLSTM NN + CRF                           423Kw tagged Mixed corpus
                  (word&char embeddings)                    141Mw Twitter corpus

                                     Table 3: Systems description.

  #    Team ID                    Tagging             References
                                  Accuracy
                                                      Barbieri, F., Basile, V., Croce, D., Nissim, M., Novielli,
  1    ILC-CNR                    0.9319 (4435)         N., Patti, V. Overview of the EVALITA 2016 SEN-
  2    UniDuisburg                0.9286 (4419)         Timent POLarity Classification Task. author=, In
  3    UniBologna UnOFF           0.9279 (4416)         Proceedings of Third Italian Conference on Compu-
  4    MIVOQ                      0.9271 (4412)         tational Linguistics (CLiC-it 2016) & Fifth Evalua-
                                                        tion Campaign of Natural Language Processing and
  5    UniBologna                 0.9246 (4400)         Speech Tools for Italian. Final Workshop (EVALITA
  6    UniGroningen               0.9225 (4390)         2016).
  7    UniGroningen UnOFF         0.9185 (4371)
  8    UniPisa                    0.9157 (4358)       Basile, v., Bolioli, A., Nissim, M., Patti, V., Rosso,
                                                        P. 2014. Overview of the Evalita 2014 SENTI-
  9    UniPisa UnOFF              0.9153 (4356)         ment POLarity Classification Task In Proceedings
  10   ILABS                      0.8790 (4183)         of Evalita 2014, 50–57.
  11   NITMZ                      0.8596 (4091)
  12   UniDuisburg UnOFF          0.8178 (3892)       Basile, P., Caputo, A., Gentile, A.L., Rizzo, G.
                                                        Overview of the EVALITA 2016 Named Entity
  13   EURAC                      0.7600 (3617)         rEcognition and Linking in Italian Tweets (NEEL-
                                                        IT) Task. In Proceedings of Third Italian Confer-
Table 4: EVALITA2016 - PoSTWITA partici-                ence on Computational Linguistics (CLiC-it 2016)
pants’ results with respect to Tagging Accuracy.        & Fifth Evaluation Campaign of Natural Language
                                                        Processing and Speech Tools for Italian. Final Work-
“UnOFF” marks unofficial results.                       shop (EVALITA 2016).

                                                      Basile, V., Nissim, M. Sentiment analysis on Ital-
                                                        ian tweets. In Proceedings of the 4th Workshop
ments in the test set. We inherited the development     on Computational Approaches to Subjectivity, Sen-
set from the SENTIPOLC task at EVALITA2014              timent and Social Media Analysis.
and the test set from SENTIPOLC2016 and,
                                                      Bosco, c., Patti, V., Bolioli, A. Developing Corpora for
maybe, the two corpora, developed in different          Sentiment Analysis: The Case of Irony and Senti-
epochs and using different criteria, could contain      TUT. IEEE Intelligent Systems, special issue on
also different kind of documents. Differences in        Knowledge-based approaches to content-level sen-
the lexicon, genre, etc. could have affected the        timent analysis. Vol 28 num 2.
training phase of taggers leading to lower results    Brants, T. 2000. TnT – A Statistical Part-of-Speech
in the evaluation phase.                                Tagger. In Proceedings of the 6th Applied Natural
                                                        Language Processing Conference.
Derczynski, L.,Ritter, A., Clark, S., Bontcheva, K.
  2013. Twitter Part-of-Speech Tagging for All: Over-
  coming Sparse and Noisy Data In Proceedings of
  RANLP 2013, 198–206.
Gimpel, K., Schneider, N., O’Connor, B., Das, D.,
  Mills, D., Eisenstein, J., Heilman, M., Yogatama,
  D., Flanigan, J., Smith, N.A. 2011. Part-of-Speech
  Tagging for Twitter: Annotation, Features, and Ex-
  periments. In Proceedings of ACL 2011.
Graves, A, Schmidhuber, J.      2005.     Framewise
  phoneme classification with bidirectional lstm and
  other neural network architectures. Neural Net-
  works, 18(5-6), 602–610.
Hochreiter, S., Schmidhuber, J. 1997. Long short-term
  memory. Neural Computation, 9(8), 1735–1780.
Minard, A.L., Speranza, M., Caselli, T. 2016 The
  EVALITA 2016 Event Factuality Annotation Task
  (FactA). In Proceedings of Third Italian Confer-
  ence on Computational Linguistics (CLiC-it 2016)
  & Fifth Evaluation Campaign of Natural Language
  Processing and Speech Tools for Italian. Final Work-
  shop (EVALITA 2016).
Neunerdt, M., Trevisan, B., Reyer, M., Mathar, R.
  2013. Part-of-speech tagging for social media texts.
  Language Processing and Knowledge in the Web.
  Springer, 139–150.
Owoputi, O., OConnor, B., Dyer, C., Gimpel, K.,
  Schneider, N., Smith, N.A. 2013. Improved Part-of-
  Speech Tagging for Online Conversational Text with
  Word Clusters. In Proceedings of NAACL 2013.