=Paper=
{{Paper
|id=Vol-1749/paper_019
|storemode=property
|title=When silver glitters more than gold: Bootstrapping an Italian part–of–speech tagger for Twitter
|pdfUrl=https://ceur-ws.org/Vol-1749/paper_019.pdf
|volume=Vol-1749
|authors=Barbara Plank,Malvina Nissim
|dblpUrl=https://dblp.org/rec/conf/clic-it/PlankN16
}}
==When silver glitters more than gold: Bootstrapping an Italian part–of–speech tagger for Twitter==
When silver glitters more than gold:
Bootstrapping an Italian part-of-speech tagger for Twitter
Barbara Plank Malvina Nissim
University of Groningen University of Groningen
The Netherlands The Netherlands
b.plank@rug.nl m.nissim@rug.nl
Abstract is to transform Twitter language back to what pre-
trained models already know via normalisation op-
English. We bootstrap a state-of-the-art erations, so that existing tools are more successful
part-of-speech tagger to tag Italian Twit- on such different data. The other option is to create
ter data, in the context of the Evalita 2016 native models by training them on labelled Twitter
PoSTWITA shared task. We show that data. The drawback of the first option is that it’s
training the tagger on native Twitter data not clear what norm to target: “what is standard
enriched with little amounts of specifi- language?” (Eisenstein, 2013; Plank, 2016), and
cally selected gold data and additional implementing normalisation procedures requires
silver-labelled data scraped from Face- quite a lot of manual intervention and subjective
book, yields better results than using large decisions. The drawback of the second option is
amounts of manually annotated data from that manually annotated Twitter data isn’t readily
a mix of genres. available, and it is costly to produce.
In this paper, we report on our participation
Italiano. Nell’ambito della campagna di in PoSTWITA1 , the EVALITA 2016 shared task
valutazione PoSTWITA di Evalita 2016, on Italian Part-of-Speech (POS) tagging for Twit-
addestriamo due modelli che differiscono ter (Tamburini et al., 2016). We emphasise an ap-
nel grado di supervisione in fase di train- proach geared to building a single model (rather
ing. Il modello addestrato con due cicli di than an ensemble) based on weakly supervised
bootstrapping usando post da Facebook, learning, thus favouring (over normalisation) the
e che quindi impara anche da etichette aforementioned second option of learning invari-
“silver”, ha una performance superiore ant representations, also for theoretical reasons.
alla versione supervisionata che usa solo We address the bottleneck of acquiring manually
dati annotati manualmente. Discutiamo annotated data by suggesting and showing that
l’importanza della scelta dei dati di train- a semi-supervised approach that mainly focuses
ing e development. on tweaking data selection within a bootstrapping
setting can be successfully pursued for this task.
Contextually, we show that large amounts of man-
1 Introduction ually annotated data might not be helpful if data
The emergence and abundance of social media isn’t “of the right kind”.
texts has prompted the urge to develop tools that
are able to process language which is often non- 2 Data selection and bootstrapping
conventional, both in terms of lexicon as well In adapting a POS tagger to Twitter, we mainly
as grammar. Indeed, models trained on standard focus on ways of selectively enriching the train-
newswire data heavily suffer when used on data ing set with additional data. Rather than simply
from a different language variety, especially Twit- adding large amounts of existing annotated data,
ter (McClosky et al., 2010; Foster et al., 2011; we investigate ways of selecting smaller amounts
Gimpel et al., 2011; Plank, 2016). of more appropriate training instances, possibly
As a way to equip microblog processing with even tagged with silver rather than gold labels. As
efficient tools, two ways of developing Twitter-
1
compliant models have been explored. One option http://corpora.ficlit.unibo.it/PoSTWITA/
for the model itself, we simply take an off-the-
Table 1: Statistics on the additional datasets.
shelf tagger, namely a bi-directional Long Short- Data Type Sents Tokens
Term Memory (bi-LSTM) model (Plank et al.,
UD FB gold 45 580
2016), which we use with default parameters (see UD verb clit+intj gold 933 26k
Section 3.2) apart from initializing it with Twitter- FB (all, iter 1) silver 2243 37k
trained embeddings (Section 3.1). FB (all, iter 2) silver 3071 47k
Our first model is trained on the PoSTWITA Total added data gold+silver 4049 74k
training set plus additional gold data selected ac-
cording to two criteria (see below: Two shades
of gold). This model is used to tag a collection the training pool, and retrained our tagger. We
of Facebook posts in a bootstrapping setting with used two iterations of indelible self-training (Ab-
two cycles (see below: Bootstrapping via Face- ney, 2007), i.e., adding automatically tagged data
book). The rationale behind using Facebook as where labels do not change once added. Using the
not-so-distant source when targeting Twitter is the Facebook API through the Facebook-sdk python
following: many Facebook posts of public, non- library3 , we scraped an average of 100 posts for
personal pages resemble tweets in style, because each of the following pages, selected on the basis
of brevity and the use of hashtags. However, of our intuition and on reasonable site popularity:
differently from random tweets, they are usually
correctly formed grammatically and spelling-wise, • sport: corrieredellosport
and often provide more context, which allows for • news: Ansa.it, ilsole24ore, lastampa.it
• politics: matteorenziufficiale
more accurate tagging. • entertainment: novella2000, alFemminile
• travel: viaggiart
Two shades of gold We used the Italian portion
of the latest release (v1.3) of the Universal De-
We included a second cycle of bootstrap-
pendency (UD) dataset (Nivre et al., 2016), from
ping, scraping a few more Facebook pages
which we extracted two subsets, according to two
(soloGossip.it, paesionline, espressonline,
different criteria. First, we selected data on the
LaGazzettaDelloSport, again with an average
basis of its origin, trying to match the Twitter
of 100 posts each), and tagging the posts with
training data as close as possible. For this rea-
the model that had been re-trained on the origi-
son, we used the Facebook subportion (UD FB).
nal training set plus the first round of Facebook
These are 45 sentences that presumably stem from
data with silver labels (we refer to the whole
the Italian Facebook help pages and contain ques-
of the automatically-labelled Facebook data as
tions and short answers.2 Second, by looking at
FB silver). FB silver was added to the
the confusion matrix of one of the initial models,
training pool to train the final model. Statistics on
we saw that the model’s performance was espe-
the obtained data are given in Table 1.4
cially poor for cliticised verbs and interjections,
tags that are also infrequent in the training set (Ta-
ble 2). Therefore, from the Italian UD portion 3 Experiments and Results
we selected any data (in terms of origin/genre)
which contained the VERB CLIT or INTJ tag, In this section we describe how we developed the
with the aim to boost the identification of these two models of the final submission, including all
categories. We refer to this set of 933 sentences as preprocessing decisions. We highlight the impor-
UD verb clit+intj. tance of choosing an adequate development set to
identify promising directions.
Bootstrapping via Facebook We augmented
our training set with silver-labelled data. With our 3.1 Experimental Setup
best model trained on the original task data plus
UD verb clit+intj and UD FB, we tagged PoSTWITA data In the context of PoSTWITA,
a collection of Facebook posts, added those to training data was provided to all participants in the
2 3
These are labelled as 4-FB in the comment section of https://pypi.python.org/pypi/facebook-sdk
4
UD. Examples include: Prima di effettuare la registrazione. Due to time constraints we did not add further iterations;
È vero che Facebook sarà a pagamento? we cannot judge if we already reached a performance plateau.
the Facebook data were more than 90% of the to-
Table 2: Tag distribution in the original trainset.
Tag Explanation #Tokens Example
kens are in all caps. Unlabelled data used for em-
NOUN noun 16378 cittadini beddings is preprocessed only with normalisation
PUNCT punctuation 14513 ?
VERB verb 12380 apprezzo of usernames and URLs.
PROPN proper noun 11092 Ancona
DET determiner 8955 il
ADP preposition 8145 per Word Embeddings We induced word embed-
ADV adverb 6041 sempre
PRON pronoun 5656 quello dings from 5 million Italian tweets (T WITA) from
ADJ adjective 5494 mondiale
HASHTAG hashtag 5395 #manovra Twita (Basile and Nissim, 2013). Vectors were
ADP A articulated prep 4465 nella created using word2vec (Mikolov and Dean,
CONJ coordinating conj 2876 ma
MENTION mention 2592 @InArteMorgan 2013) with default parameters, except for the fact
AUX auxiliary verb 2273 potrebbe
URL url 2141 http://t.co/La3opKcp that we set the dimensions to 64, to match the vec-
SCONJ subordinating conj 1521 quando tor size of the multilingual (P OLY) embeddings
INTJ interjection 1404 fanculo
NUM number 1357 23% (Al-Rfou et al., 2013) used by Plank et al. (2016).
X anything else 776 s...
EMO emoticon 637 We dealt with unknown words by adding a “UNK”
VERB CLIT verb+clitic 539 vergognarsi token computing the mean vector of three infre-
SYM symbol 334 →
PART particle 3 ’s quent words (“vip!”,“cuora”, “White”).
form of manually labelled tweets. The tags com-
ply with the UD tagset, with a couple of modi-
fications due to the specific genre (emoticons are
labelled with a dedicated tag, for example), and
subjective choices in the treatment of some mor-
phological traits typical of Italian. Specifically, Figure 1: Word cloud from the training data.
clitics and articulated prepositions are treated as
one single form (see below: UD fused forms). The
training set contains 6438 tweets, for a total of Creation of a realistic internal development set
ca. 115K tokens. The distribution of tags together The original task data is distributed as a single
with examples is given in Table 2. The test set training file. In initial experiments we saw that
comprises 301 tweets (ca. 4800 tokens). performance varied considerably for different ran-
dom subsets. This was due to a large bias towards
UD fused forms In the UD scheme for Ital- tweets about ‘Monti’ and ‘Grillo’, see Figure 1,
ian, articulated prepositions (ADP A) and cliti- but also because of duplicate tweets. We opted
cised verbs (VERB CLIT) are annotated as sep- to create the most difficult development set possi-
arate word forms, while in PoSTWITA the origi- ble. This development set was achieved by remov-
nal word form (e.g., ‘alla’ or ‘arricchirsi’) is an- ing duplicates, and randomly selecting a subset
notated as a whole. In order to get the PoST- of tweets that do not mention ‘Grillo’ or ‘Monti’
WITA ADP A and VERB CLIT tags for these while maximizing out-of-vocabulary (OOV) rate
fused word forms from UD, we adjust the UCPH with respect to the training data. Hence, our inter-
ud-conversion-tools5 (Agić et al., 2016) nal development set consisted of 700 tweets with
that propagates head POS information up to the an OOV approaching 50%. This represents a more
original form. realistic testing scenario. Indeed, the baseline (the
basic bi-LSTM model), dropped from 94.37 to
Pre-processing of unlabelled data For the 92.41 computed on the earlier development set
Facebook data, we use a simplistic off-the- were we had randomly selected 1/5 of the data,
shelf rule-based tokeniser that segments sen- with an OOV of 45% (see Table 4).
tences by punctuation and tokens by whites-
pace.6 We normalise URLs to a single token 3.2 Model
(http://www.someurl.org) and add a rule
for smileys. Finally, we remove sentences from The bidirectional Long Short-Term Memory
model bilty7 is illustrated in Figure 2. It is a
5
https://github.com/coastalcph/ud-conversion-tools
6 7
https://github.com/bplank/multilingualtokenizer https://github.com/bplank/bilstm-aux
Table 3: Results on the official test set. B EST is
the highest performing system at PoSTWITA.
System Accuracy
B EST 93.19
S ILVER B OOT (official) 92.25
G OLD P ICK (unofficial) 91.85
T N T (on P O STWITA train) 84.83
Figure 2: Hierarchical bi-LSTM model using T N T (on S ILVER B OOT data) 85.52
word w~ and character ~c representations.
Table 3 shows the results on the official test
context bi-LSTM taking as input word embed-
data for both our models and T N T (Brants, 2000).
dings w.
~ Character embeddings ~c are incorporated
The results show that adding bootstrapped silver
via a hierarchical bi-LSTM using a sequence
data outperforms the model trained on gold data
bi-LSTM at the lower level (Ballesteros et al.,
alone. The additional training data included in
2015; Plank et al., 2016). The character repre-
S ILVER B OOT reduced the OOV rate for the test-
sentation is concatenated with the (learned) word
set to 41.2% (compared to 46.9% with respect to
embeddings w ~ to form the input to the context
the original PoSTWITA training set). Note that,
bi-LSTM at the upper layers. We took default
on the original randomly selected development set
parameters, i.e., character embeddings set to 100,
the results were less indicative of the contribution
word embeddings set to 64, 20 iterations of train-
of the silver data (see Table 4), showing the impor-
ing using Stochastic Gradient Descent, a single
tance of a carefully selected development set.
bi-LSTM layer and regularization using Gaussian
noise with σ = 0.2 (cdim 100, trainer
sgd, indim 64, iters 20, h layer 4 What didn’t work
1, sigma 0.2). The model has been shown to
achieve state-of-the-art performance on a range of In addition to what we found to boost the tagger’s
languages, where the incorporation of character performance, we also observed what didn’t yield
information was particularly effective (Plank et any improvements, and in some case even lowered
al., 2016). With these features and settings we global accuracy. What we experimented with was
train two models on different training sets. triggered by intuition and previous work, as well
as what we had already found to be successful,
G OLD P ICK bilty with pre-initialised T WITA such as selecting additional data to make up for
embeddings, trained on the PoSTWITA train- under-represented tags in the training set. How-
ing set plus selected gold data (UD FB + ever, everything we report in this section turned
UD verb clit+intj). out to be either pointless or detrimental.
S ILVER B OOT a bootstrapped version of G OLD - More data We added to the training data all
P ICK, where FB silver (see Section 2) is also (train, development, and test) sections from the
added to the training pool, which thus includes Italian part of UD1.3. While training on selected
both gold and silver data. gold data (978 sentences) yielded 95.06% accu-
racy, adding all of the UD-data (12k sentences
of newswire, legal and wiki texts) yielded a dis-
3.3 Results on test data appointing 94.88% in initial experiments (see Ta-
ble 4), also considerably slowing down training.
Participants were allowed to submit one official,
and one additional (unofficial) run. Because on Next, we tried to add more Twitter data from
development data S ILVER B OOT performed better X LIME, a publicly available corpus with multiple
than G OLD P ICK, we selected the former for our layers of manually assigned labels, including POS
official submission and the latter for the unofficial tags, for a total of ca. 8600 tweets and 160K to-
one, making it thus also possible to assess the spe- kens (Rei et al., 2016). The data isn’t provided
cific contribution of bootstrapping to performance. as a single gold standard file but in the form of
converge better and overfit less on the former. In
Table 4: Results on internal development set.
this context, the additional signal we use to sup-
System Accuracy port the learning of each token’s POS tag is the
Internal dev (prior) OOV: 45% token’s degree of ambiguity. Using the informa-
BASELINE (w/o emb) 94.37 tion stored in Morph-it!, a lexicon of Italian in-
+P OLY emb 94.15 flected forms with their lemma and morphologi-
+T WITA emb 94.69 cal features (Zanchetta and Baroni, 2005), we ob-
tained the number of all different tags potentially
BASELINE+T WITA emb
associated to each token. Because the Morph-it!
+Morphit! coarse MTL 94.61
labels are highly fine-grained we derived two dif-
+Morphit! fine MTL 94.68
ferent ambiguity scores, one on the original and
+UD all 94.88 one on coarser tags. In neither case the additional
+gold-picked 95.06 signal contributed to the tagger’s performance, but
+gold-picked+silver (1st round) 95.08 we have not explored this direction fully and leave
Internal dev (realistic) OOV: 50% it for future investigations.
BASELINE (incl. T WITA emb) 92.41
+gold (G OLD P ICK) 93.19 5 Conclusions
+gold+silver (S ILVER B OOT) 93.42
adding more gold (Twitter) data: The main conclusion we draw from the experi-
+X LIME ADJUDICATED (48) 92.58 ments in this paper is that data selection matters,
+X LIME SINGLE ANNOT. 91.67 not only for training but also while developing for
+X LIME ALL (8k) 92.04 taking informed decisions. Indeed, only after cre-
ating a carefully designed internal development set
we obtained stronger evidence of the contribution
separate annotations produced by different judges, of silver data which is also reflected in the offi-
so that we used MACE (Hovy et al., 2013) to ad- cial results. We also observe that choosing less but
judicate divergences. Additionally, the tagset is more targeted data is more effective. For instance,
slightly different from the UD set, so that we had T WITA embeddings contribute more than generic
to implement a mapping. The results in Table 4 P OLY embeddings which were trained on substan-
show that adding all of the X LIME data declines tially larger amounts of Wikipedia data. Also, just
performance, despite careful preprocessing to map blindly adding training data does not help. We
the tags and resolve annotation divergences. have seen that using the whole of the UD corpus
is not beneficial to performance when compared
More tag-specific data From the matrix com- to a small amount of selected gold data, both in
puted on the dev set, it emerged that the most terms of origin and labels covered. Finally, and
confused categories were NOUN and PROPN. Fol- most importantly, we have found that adding little
lowing the same principle that led us to add amounts of not-so-distant silver data obtained via
UD verb clit+intj, we tried to reduce such bootstrapping resulted in our best model.
confusion by providing additional training data We believe the low performance observed when
containing proper nouns. This did not yield any adding xLIME data is likely due to the non-
improvements, neither in terms of global accuracy, correspondence of tags in the two datasets, which
nor in terms of precision and recall of the two tags. required a heuristic-based mapping. While this
is only a speculation that requires further inves-
Multi-task learning Multi-task learning (MTL) tigation, it seems to indicate that exploring semi-
(Caruana, 1997), namely a learning setting where supervised strategies is preferrable to producing
more than one task is learnt at the same time, has idiosyncratic or project-specific gold annotations.
been shown to improve performance for several
NLP tasks (Collobert et al., 2011; Bordes et al., Acknowledgments We thank the CIT of the
2012; Liu et al., 2015). Often, what is learnt is University of Groningen for providing access to
one main task and, additionally, a number of aux- the Peregrine HPC cluster. Barbara Plank ac-
iliary tasks, where the latter should help the model knowledges NVIDIA corporation for support.
References Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng,
Kevin Duh, and Ye-Yi Wang. 2015. Representation
Steven Abney. 2007. Semisupervised learning for learning using multi-task deep neural networks for
computational linguistics. CRC Press. semantic classification and information retrieval. In
Proc. NAACL.
Željko Agić, Anders Johannsen, Barbara Plank, Héctor
Martı́nez Alonso, Natalie Schluter, and Anders David McClosky, Eugene Charniak, and Mark John-
Søgaard. 2016. Multilingual projection for parsing son. 2010. Automatic domain adaptation for pars-
truly low-resource languages. Transactions of the ing. In NAACL-HLT.
Association for Computational Linguistics (TACL),
4:301–312. T Mikolov and J Dean. 2013. Distributed representa-
tions of words and phrases and their compositional-
Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. ity. Advances in neural information processing sys-
2013. Polyglot: Distributed word represen- tems.
tations for multilingual NLP. arXiv preprint
arXiv:1307.1662. Joakim Nivre et al. 2016. Universal dependencies 1.3.
LINDAT/CLARIN digital library at the Institute of
Miguel Ballesteros, Chris Dyer, and Noah A. Smith. Formal and Applied Linguistics, Charles University
2015. Improved transition-based parsing by mod- in Prague.
eling characters instead of words with lstms. In
EMNLP. Barbara Plank, Anders Søgaard, and Yoav Goldberg.
2016. Multilingual part-of-speech tagging with
Valerio Basile and Malvina Nissim. 2013. Sentiment bidirectional long short-term memory models and
analysis on italian tweets. In Proceedings of the 4th auxiliary loss. In ACL.
Workshop on Computational Approaches to Subjec-
tivity, Sentiment and Social Media Analysis, pages Barbara Plank. 2016. What to do about non-standard
100–107. (or non-canonical) language in NLP. In KONVENS.
Antoine Bordes, Xavier Glorot, Jason Weston, and Luis Rei, Dunja Mladenic, and Simon Krek. 2016. A
Yoshua Bengio. 2012. Joint learning of words multilingual social media linguistic corpus. In Con-
and meaning representations for open-text semantic ference of CMC and Social Media Corpora for the
parsing. In AISTATS, volume 351, pages 423–424. Humanities.
Thorsten Brants. 2000. Tnt: a statistical part-of- Fabio Tamburini, Cristina Bosco, Alessandro Mazzei,
speech tagger. In ANLP. and Andrea Bolioli. 2016. Overview of the
Rich Caruana. 1997. Multitask learning. Machine EVALITA 2016 Part Of Speech on TWitter for ITAl-
Learning, 28:41–75. ian Task. In Pierpaolo Basile, Anna Corazza, Franco
Cutugno, Simonetta Montemagni, Malvina Nis-
Ronan Collobert, Jason Weston, Léon Bottou, Michael sim, Viviana Patti, Giovanni Semeraro and Rachele
Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Sprugnoli, editors, Proceedings of Third Italian
2011. Natural language processing (almost) from Conference on Computational Linguistics (CLiC-it
scratch. Journal of Machine Learning Research, 2016) & Fifth Evaluation Campaign of Natural Lan-
12(Aug):2493–2537. guage Processing and Speech Tools for Italian. Final
Workshop (EVALITA 2016). Associazione Italiana di
Jacob Eisenstein. 2013. What to do about bad lan- Linguistica Computazionale (AILC).
guage on the internet. In Proceedings of the An-
nual Conference of the North American Chapter Eros Zanchetta and Marco Baroni. 2005. Morph-it!
of the Association for Computational Linguistics A free corpus-based morphological resource for the
(NAACL), pages 359–369, Atlanta. Italian language. Corpus Linguistics 2005, 1(1).
Jennifer Foster, Ozlem Cetinoglu, Joachim Wagner,
Josef Le Roux, Joakim Nivre, Deirde Hogan, and
Josef van Genabith. 2011. From news to comments:
Resources and benchmarks for parsing the language
of Web 2.0. In IJCNLP.
Kevin Gimpel, Nathan Schneider, Brendan O’Connor,
Dipanjan Das, Daniel Mills, Jacob Eisenstein,
Michael Heilman, Dani Yogatama, Jeffrey Flanigan,
and Noah A. Smith. 2011. Part-of-Speech Tagging
for Twitter: Annotation, Features, and Experiments.
In Proceedings of ACL.
Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani,
and Eduard Hovy. 2013. Learning whom to trust
with MACE. In NAACL.