=Paper=
{{Paper
|id=Vol-1749/paper_013
|storemode=property
|title=Overview of the EVALITA 2016 Part Of Speech on TWitter for ITAlian Task
|pdfUrl=https://ceur-ws.org/Vol-1749/paper_013.pdf
|volume=Vol-1749
|authors=Cristina Bosco,Fabio Tamburini,Andrea Bolioli,Alessandro Mazzei
|dblpUrl=https://dblp.org/rec/conf/clic-it/BoscoTBM16
}}
==Overview of the EVALITA 2016 Part Of Speech on TWitter for ITAlian Task==
Overview of the EVALITA 2016
Part Of Speech on TWitter for ITAlian Task
Cristina Bosco Fabio Tamburini,
Dip. di Informatica, Università di Torino FICLIT, University of Bologna, Italy
bosco@di.unito.it fabio.tamburini@unibo.it
Andrea Bolioli Alessandro Mazzei
CELI Dip. di Informatica, Università di Torino
abolioli@celi.it mazzei@di.unito.it
Abstract 1 Introduction and motivation
English. The increasing interest for the In the past the effort on Part-of-Speech (PoS) tag-
extraction of various forms of knowledge ging has mainly focused on texts featured by stan-
from micro-blogs and social media makes dard forms and syntax. However, in the last few
crucial the development of resources and years the interest in automatic evaluation of social
tools that can be used for automatically media texts, in particular from microblogging such
deal with them. PoSTWITA contributes to as Twitter, has grown considerably: the so-called
the advancement of the state-of-the-art for user-generated contents have already been shown
Italian language by: (a) enriching the com- to be useful for a variety of applications for identi-
munity with a previously not existing col- fying trends and upcoming events in various fields.
lection of data extracted from Twitter and
As social media texts are clearly different
annotated with grammatical categories, to
from standardized texts, both regarding the na-
be used as a benchmark for system evalu-
ture of lexical items and their distributional prop-
ation; (b) supporting the adaptation of Part
erties (short messages, emoticons and mentions,
of Speech tagging systems to this particu-
threaded messages, etc.), Natural Language Pro-
lar text domain.
cessing methods need to be adapted for deal with
Italiano. La crescente rilevanza them obtaining reliable results in processing. The
dell’estrazione di varie forme di basis for such an adaption are tagged social me-
conoscenza da testi derivanti da micro- dia text corpora (Neunerdt et al., 2013) for train-
blog e social media rende cruciale lo ing and testing automatic procedures. Even if
sviluppo di strumenti e risorse per il various attempts to produce such kind of spe-
trattamento automatico. PoSTWITA si cialised resources and tools are described in lit-
propone di contribuire all’avanzamento erature for other languages (e.g. (Gimpel et al.,
dello stato dell’arte per la lingua ital- 2011; Derczynski et al., 2013; Neunerdt et al.,
iana in due modi: (a) fornendo alla 2013; Owoputi et al., 2013)), Italian currently
comunità una collezione di dati estratti completely lacks of them both.
da Twitter ed annotati con le categorie For all the above mentioned reasons, we pro-
grammaticali, risorsa precedentemente posed a task for EVALITA 2016 concerning the
non esistente, da utlizzare come banco domain adaptation of PoS-taggers to Twitter texts.
di prova nella valutazione di sistemi; Participants to the evaluation campaign were re-
(b) promuovendo l’adattamento a questo quired to use the two following data sets provided
particolare dominio testuale dei sistemi di by the organization to set up their systems: the
Part of Speech tagging che partecipano al first one, henceforth referred to as Development
task. Set (DS), contains data manually annotated using
a specific tagset (see section 2.2 for the tagset de-
scription) and must be used to train participants
Authors order has been decided by coin toss. systems; the second one, referred to as Test Set
(TS), contains the test data in blind format for the the TS, is instead composed by 300 tweets (4,759
evaluation and has been given to participants in the tokens).
date scheduled for the evaluation. The tokenisation and annotation of all data have
For better focusing the task on the challenges re- been first carried out by automatic tools, with a
lated to PoS tagging, but also for avoiding the bor- high error rate which is motivated by the features
ing problem of disappeared tweets, the distributed of the domain and text genre. We adapted the
version of tweets has been previously tokenised, Tweet-NLP tokeniser (Gimpel et al., 2011) to Ital-
splitting each token on a different line. ian for token segmentation and used the TnT tag-
Moreover, according to an “open task” perspec- ger (Brants, 2000) trained on the Universal Depen-
tive, participants were allowed to use other re- dencies corpus (v1.3) for the first PoS-tagging step
sources with respect to those released for the (see also section 2.2).
task, both for training and to enhance final perfor- The necessary manual correction has been ap-
mances, as long as their results apply the proposed plied by two different skilled humans working in-
tagsets. dependently on data. The versions produced by
The paper is organized as follows. The next sec- them have been compared in order to detect dis-
tion describes the data exploited in the task, the agreements, conflicts or residual errors which have
annotation process and the issues related to the to- been finally resolved by the contribution of a third
kenisation and tagging applied to the dataset. The annotator.
following section is instead devoted to the descrip- Nevertheless, assuming that the datasets of PoST-
tion of the evaluation metrics and participants re- WITA are developed from scratch for what con-
sults. Finally, we discuss the main issues involved cerns the tokenisation and annotation of grammat-
in PoSTWITA. ical categories, we expected the possible presence
of a few residual errors also after the above de-
2 Data Description scribed three phases of the annotation process.
Therefore, during the evaluation campaign, and
For the corpus of the proposed task, we collected
before the date scheduled for the evaluation, all
tweets being part of the EVALITA2014 SEN-
participants were invited and encouraged to com-
TIment POLarity Classification (SENTIPOLC)
municate to the organizers any errors found in the
(Basile et al., 2014) task dataset, benefitting of the
DS. This allowed the organizers (but not the par-
fact that it is cleaned from repetitions and other
ticipants) to update and redistribute it to the par-
possible sources of noise. The SENTIPOLC cor-
ticipants in an enhanced form.
pus originates from a set of tweets (Twita) ran-
domly collected (Basile et al., 2013), and a set No lexical resource has been distributed with
of posts extracted exploiting specific keywords PoSTWITA 2016 data, since each participant is al-
and hashtags marking political topics (SentiTUT) lowed to use any available lexical resource or can
(Bosco et al., 2013). freely induce it from the training data.
In order to work in a perspective of the devel- All the data are provided as plain text files in
opment of a benchmark where a full pipeline of UNIX format (thus attention must be paid to new-
NLP tools can be applied and tested in the future, line character format), tokenised as described in
the same selection of tweets has been exploited section 2.1, but only those of the DS have been
in other EVALITA2016 tasks, in particular in the released with the adequate PoS tags described in
EVALITA 2016 SENTiment POLarity Classifica- section 2.2. The TS contains only the tokenised
tion Task (SENTIPOLC) (Barbieri et al., 2016), words but not the correct tags, that have to be
Named Entity rEcognition and Linking in Italian added by the participant systems to be submit-
Tweets (NEEL-IT) (Basile et al., 2016) and Event ted for the evaluation. The correct tokenised and
Factuality Annotation Task (FactA) (Minard et tagged data of the TS (called gold standard TS),
al., 2016). exploited for the evaluation, has been provided to
Both the development and test set of the participants after the end of the contest, to-
EVALITA2016 has been manually annotated gether with their score.
with PoS tags. The former, which has been According to the treatment in the dataset from
distributed as the DS for PoSTWITA, includes where our data are extracted, each tweet in PoST-
6,438 tweets (114,967 tokens). The latter, that is WITA corpus is considered as a separate entity and
we did not preserved thread integrity, thus taggers with the other UD datasets, and strongly improves
participating to the contest have to process each the portability of our newly developed datasets to-
tweet separately. wards this standard.
Assuming, as usual and more suitable in PoS tag-
2.1 Tokenisation Issues ging, a neutral perspective with respect to the so-
The problem of text segmentation (tokenisation) is lution of parsing problems (more relevant in build-
a central issue in PoS-tagger evaluation and com- ing treebanks), we differentiated our format from
parison. In principle, for practical applications, that one applied in UD, by maintaining the word
every system should apply different tokenisation unsplitted rather than splitted in different tokens,
rules leading to different outputs. also in the two following cases:
We provided in the evaluation campaign all the
development and test data in tokenised format, • for the articulated prepositions (e.g. dalla
one token per line followed by its tag (when (from-the[fem]), nell´ (in-the[masc]), al (to-
applicable), following the schema: the), ...)
ID TWEET 1 162545185920778240 • for the clitic clusters, which can be attached
Governo PROPN
Monti PROPN
to the end of a verb form (e.g. regalaglielo
: PUNCT (gift-to-him-it), dandolo (giving-it), ...)
decreto NOUN
in ADP
cdm PROPN For this reason, we decided also to define two
per ADP
approvazione NOUN novel specific tags to be assigned in these cases
! PUNCT (see section 1): ADP A and VERB CLIT re-
http://t.co/Z76KLLGP URL
spectively for articulated prepositions and clitics,
ID TWEET 2 192902763032743936
#Ferrara HASHTAG according to the strategy assumed in previous
critica VERB
#Grillo HASHTAG
EVALITA PoS tagging evaluations.
perché SCONJ The participants are requested to return the test
...
file using exactly the same tokenisation format,
The first line for each tweet contains the Tweet containing exactly the same number of tokens.
ID, while the line of each tweet after the last one is The comparison with the reference file will be per-
empty, in order to separate each post from the fol- formed line-by-line, thus a misalignment will pro-
lowing. The example above shows some tokenisa- duce wrong results.
tion and formatting issues, in particular:
2.2 Tagset
• accents, which are coded using UTF-8 encod-
ing table; Beyond the introduction of the novel labels cited
above, motivated by tokenisation issues and re-
• apostrophe, which is tokenised separately lated to articulated prepositions and clitic clus-
only when used as quotation mark, not ters, for what concerns PoS tagging labels, fur-
when signalling a removed character (like in ther modifications with respect to UD standard
dell’/orto) are instead motivated by the necessity of more
All the other features of data annotation are de- specific labels to represent particular phenomena
scribed in details in the following parts of this sec- often occurring in social media texts. We in-
tion. troduced therefore new Twitter-specific tags for
cases that following the UD specifications should
For what concerns tokenisation and tagging be all classified into the generic SYM (symbol)
principles in EVALITA2016 PoSTWITA, we de- class, namely emoticons, Internet addresses, email
cided to follow the strategy proposed in the Uni- addresses, hashtags and mentions (EMO, URL,
versal Dependencies (UD) project for Italian1 ap- EMAIL, HASHTAG and MENTION). See Table
plying only minor changes, which are motivated 1 for a complete description of the PoSTWITA
by the special features of the domain addressed tagset.
in the task. This makes the EVALITA2016- We report in the following the more challenging
PoSTWITA gold standard annotation compliant issues addressed in the development of our data
1
http://universaldependencies.org/it/ sets, i.e. the management of proper nouns and of
pos/index.html foreign words.
Tagset Category Examples
UD PoSTWITA16 if different from UD specs
ADJ ADJ Adjective -
ADP ADP Adposition (simple prep.) di, a, da, in, con, su, per
ADP A Adposition (prep.+Article) dalla, nella, sulla, dell
ADV ADV Adverb -
AUX AUX Auxiliary Verb -
CONJ CONJ Coordinating Conjunction -
DET DET Determiner -
INTJ INTJ Interjection -
NOUN NOUN Noun -
NUM NUM Numeral -
PART PART Particle -
PRON PRON Pronoun -
PROPN PROPN Proper Noun -
PUNCT PUNCT punctuation -
SCONJ SCONJ Subordinating Conjunction -
SYM SYM Symbol -
EMO Emoticon/Emoji :-) ˆ ˆ ♥ :P
URL Web Address http://www.somewhere.it
EMAIL Email Address someone@somewhere.com
HASHTAG Hashtag #staisereno
MENTION Mention @someone
VERB VERB Verb -
VERB CLIT Verb + Clitic pronoun cluster mangiarlo, donarglielo
X X Other or RT/rt -
Table 1: EVALITA2016 - PoSTWITA tagset.
2.2.1 Proper Noun Management Nevertheless, in some other cases, the upper-
The annotation of named entities (NE) poses a case letter has not been considered enough to de-
number of relevant problems in tokenisation and termine the introduction of a PROPN tag:
PoS tagging. The most coherent way to handle “...anche nei Paesi dove..., “...in contraddizione
such kind of phenomena is to consider each NE con lo Stato sociale...”.
as a unique token assigning to it the PROPN tag. This strategy is devoted to produce a data set that
Unfortunately this is not a viable solution for this incorporates the speakers linguistic intuition about
evaluation task, and, moreover, a lot of useful this kind of structures, regardless of the possibil-
generalisation on n-gram sequences (e.g. Minis- ity of formalization of the involved knowledge in
tero/dell/Interno PROPN/ADP A/PROPN) would automatic processing.
be lost if adopting such kind of solution. Anyway,
2.2.2 Foreign words
the annotation of sequences like Banca Popolare
and Presidente della Repubblica Italiana deserve Non-Italian words are annotated, when possible,
some attention and a clear policy. following the same PoS tagging criteria adopted in
Following the approach applied in Evalita 2007 for UD guidelines for the referring language. For in-
the PoS tagging task, we annotate as PROPN those stance, good-bye is marked as an interjection with
words of the NE which are marked by the upper- the label INTJ.
case letter, like in the following examples:
3 Evaluation Metrics
Banca PROPN Presidente PROPN Ordine PROPN
Popolare PROPN della ADP A dei ADP A The evaluation is performed in a black box ap-
Repubblica PROPN Medici PROPN proach: only the systems output is evaluated. The
Italiana PROPN evaluation metric will be based on a token-by-
Team ID Team Affiliations
EURAC E.W. Stemle Inst. for Specialised Commun. and Multilingualism,
EURAC Research, Bolzano/Bozen, Italy
ILABS C. Aliprandi, L De Mattei Integris Srl, Roma, Italy
ILC-CNR A. Cimino, F. Dell’Orletta Istituto di Linguistica Computazionale Antonio Zampolli
CNR, Pisa, Italy
MIVOQ Giulio Paci Mivoq Srl, Padova, Italy
NITMZ P. Pakray, G. Majumder Deptt. of Computer Science & Engg., Nat. Inst. of Tech.,
Mizoram,Aizawl, India
UniBologna F. Tamburini FICLIT, University of Bologna, Italy
UniDuisburg T. Horsmann, T. Zesch Language Technology Lab Dept. of Comp. Science and
Appl. Cog. Science, Univ. of Duisburg-Essen, Germany
UniGroningen B. Plank, M. Nissim University of Groningen, The Nederlands
UniPisa G. Attardi, M. Simi Dipartimento di Informatica, Universit di Pisa, Italy
Table 2: Teams participating at the EVALITA2016 - PoSTWITA task.
token comparison and only a single tag is allowed matic PoS-taggers when annotating tweets
for each token. The considered metric is the Tag- are lower than when working on normal texts,
ging accuracy: it is defined as the number of cor- but are in line with the state-of-the art for
rect PoS tag assignment divided by the total num- other languages;
ber of tokens in TS.
• all the top-performing systems are based
4 Teams and Results on Deep Neural Networks and, in particu-
lar, on Long Short-Term Memories (LSTM)
16 teams registered for this task, but only 9 sub- (Hochreiter, Schmidhuber, 1997; Graves,
mitted a final run for the evaluation. Table 2 out- Schmidhuber, 1997);
lines participants’ main data: 7 participant teams
belong to universities or other research centres and • most systems use word or character embed-
the last 2 represent private companies working in dings as inputs for their systems;
the NLP and speech processing fields.
• more or less all the presented systems make
Table 3 describes the main features of the eval-
use of additional resources or knowledge
uated systems w.r.t. the core methods and the ad-
(morphological analyser, additional tagged
ditional resources employed to develop the pre-
corpora and/or large non-annotated twitter
sented system.
corpora).
In the Table 4 we report the final results of the
PoSTWITA task of the EVALITA2016 evaluation Looking at the official results, and comparing
campaign. In the submission of the result, we al- them with the experiments that the participants
low to submit a single “official” result and, op- devised to set up their own system (not reported
tionally, one “unofficial” result (“UnOFF” in the here, please look at the participants’ reports), it
table): UniBologna, UniGroningen, UnPisa and is possible to note the large difference in perfor-
UniDuisburg decided to submit one more unof- mances. During the setup phase most systems,
ficial result. The best result has been achieved among the top-performing ones, obtained coher-
by the ILC-CNR group (93.19% corresponding to ent results well above 95/96% of accuracy on the
4, 435 correct tokens over 4, 759). development set (either splitting it into a train-
ing/validation pair or by making cross-validation
5 Discussion and Conclusions tests), while the best performing system in the
Looking at the results we can draw some provi- official evaluation exhibit performances slightly
sional conclusions about the PoS-tagging of Ital- above 93%. It is a huge difference for this kind
ian tweets: of task, rarely observed in literature.
One possible reason that could explain this dif-
• as expected, the performances of the auto- ference in performances regards the kind of docu-
Team ID Core methods Resources (other than DS)
EURAC LSTM NN DiDi-IT
(word&char embeddings)
ILABS Perceptron algorithm word features extracted from proprietary
resources and 250k entries of wikitionary.
ILC-CNR two-branch BiLSTM NN Morhological Analyser (65,500 lemmas) +
(word&char embeddings) ItWaK corpus
MIVOQ Tagger combination based on Yamcha Evalita2009 Pos-tagged data
ISTC pronunciation dictionary
NITMZ HMM bigram model -
UniBologna Stacked BiLSTM NN + CRF Morhological Analyser (110,000 lemmas) +
(augmented word embeddings) 200Mw twitter corpus
UniDuisburg CRF classifier 400Mw Twitter corpus
UniGroningen BiLSTM NN Universal Dependencies v1.3
(word embedding) 74 kw tagged Facebook corpus
UniPisa BiLSTM NN + CRF 423Kw tagged Mixed corpus
(word&char embeddings) 141Mw Twitter corpus
Table 3: Systems description.
# Team ID Tagging References
Accuracy
Barbieri, F., Basile, V., Croce, D., Nissim, M., Novielli,
1 ILC-CNR 0.9319 (4435) N., Patti, V. Overview of the EVALITA 2016 SEN-
2 UniDuisburg 0.9286 (4419) Timent POLarity Classification Task. author=, In
3 UniBologna UnOFF 0.9279 (4416) Proceedings of Third Italian Conference on Compu-
4 MIVOQ 0.9271 (4412) tational Linguistics (CLiC-it 2016) & Fifth Evalua-
tion Campaign of Natural Language Processing and
5 UniBologna 0.9246 (4400) Speech Tools for Italian. Final Workshop (EVALITA
6 UniGroningen 0.9225 (4390) 2016).
7 UniGroningen UnOFF 0.9185 (4371)
8 UniPisa 0.9157 (4358) Basile, v., Bolioli, A., Nissim, M., Patti, V., Rosso,
P. 2014. Overview of the Evalita 2014 SENTI-
9 UniPisa UnOFF 0.9153 (4356) ment POLarity Classification Task In Proceedings
10 ILABS 0.8790 (4183) of Evalita 2014, 50–57.
11 NITMZ 0.8596 (4091)
12 UniDuisburg UnOFF 0.8178 (3892) Basile, P., Caputo, A., Gentile, A.L., Rizzo, G.
Overview of the EVALITA 2016 Named Entity
13 EURAC 0.7600 (3617) rEcognition and Linking in Italian Tweets (NEEL-
IT) Task. In Proceedings of Third Italian Confer-
Table 4: EVALITA2016 - PoSTWITA partici- ence on Computational Linguistics (CLiC-it 2016)
pants’ results with respect to Tagging Accuracy. & Fifth Evaluation Campaign of Natural Language
Processing and Speech Tools for Italian. Final Work-
“UnOFF” marks unofficial results. shop (EVALITA 2016).
Basile, V., Nissim, M. Sentiment analysis on Ital-
ian tweets. In Proceedings of the 4th Workshop
ments in the test set. We inherited the development on Computational Approaches to Subjectivity, Sen-
set from the SENTIPOLC task at EVALITA2014 timent and Social Media Analysis.
and the test set from SENTIPOLC2016 and,
Bosco, c., Patti, V., Bolioli, A. Developing Corpora for
maybe, the two corpora, developed in different Sentiment Analysis: The Case of Irony and Senti-
epochs and using different criteria, could contain TUT. IEEE Intelligent Systems, special issue on
also different kind of documents. Differences in Knowledge-based approaches to content-level sen-
the lexicon, genre, etc. could have affected the timent analysis. Vol 28 num 2.
training phase of taggers leading to lower results Brants, T. 2000. TnT – A Statistical Part-of-Speech
in the evaluation phase. Tagger. In Proceedings of the 6th Applied Natural
Language Processing Conference.
Derczynski, L.,Ritter, A., Clark, S., Bontcheva, K.
2013. Twitter Part-of-Speech Tagging for All: Over-
coming Sparse and Noisy Data In Proceedings of
RANLP 2013, 198–206.
Gimpel, K., Schneider, N., O’Connor, B., Das, D.,
Mills, D., Eisenstein, J., Heilman, M., Yogatama,
D., Flanigan, J., Smith, N.A. 2011. Part-of-Speech
Tagging for Twitter: Annotation, Features, and Ex-
periments. In Proceedings of ACL 2011.
Graves, A, Schmidhuber, J. 2005. Framewise
phoneme classification with bidirectional lstm and
other neural network architectures. Neural Net-
works, 18(5-6), 602–610.
Hochreiter, S., Schmidhuber, J. 1997. Long short-term
memory. Neural Computation, 9(8), 1735–1780.
Minard, A.L., Speranza, M., Caselli, T. 2016 The
EVALITA 2016 Event Factuality Annotation Task
(FactA). In Proceedings of Third Italian Confer-
ence on Computational Linguistics (CLiC-it 2016)
& Fifth Evaluation Campaign of Natural Language
Processing and Speech Tools for Italian. Final Work-
shop (EVALITA 2016).
Neunerdt, M., Trevisan, B., Reyer, M., Mathar, R.
2013. Part-of-speech tagging for social media texts.
Language Processing and Knowledge in the Web.
Springer, 139–150.
Owoputi, O., OConnor, B., Dyer, C., Gimpel, K.,
Schneider, N., Smith, N.A. 2013. Improved Part-of-
Speech Tagging for Online Conversational Text with
Word Clusters. In Proceedings of NAACL 2013.