When silver glitters more than gold:
           Bootstrapping an Italian part-of-speech tagger for Twitter

                   Barbara Plank                                     Malvina Nissim
                University of Groningen                           University of Groningen
                   The Netherlands                                   The Netherlands
                 b.plank@rug.nl                                    m.nissim@rug.nl


                    Abstract                          is to transform Twitter language back to what pre-
                                                      trained models already know via normalisation op-
    English. We bootstrap a state-of-the-art          erations, so that existing tools are more successful
    part-of-speech tagger to tag Italian Twit-        on such different data. The other option is to create
    ter data, in the context of the Evalita 2016      native models by training them on labelled Twitter
    PoSTWITA shared task. We show that                data. The drawback of the first option is that it’s
    training the tagger on native Twitter data        not clear what norm to target: “what is standard
    enriched with little amounts of specifi-          language?” (Eisenstein, 2013; Plank, 2016), and
    cally selected gold data and additional           implementing normalisation procedures requires
    silver-labelled data scraped from Face-           quite a lot of manual intervention and subjective
    book, yields better results than using large      decisions. The drawback of the second option is
    amounts of manually annotated data from           that manually annotated Twitter data isn’t readily
    a mix of genres.                                  available, and it is costly to produce.
                                                         In this paper, we report on our participation
    Italiano. Nell’ambito della campagna di           in PoSTWITA1 , the EVALITA 2016 shared task
    valutazione PoSTWITA di Evalita 2016,             on Italian Part-of-Speech (POS) tagging for Twit-
    addestriamo due modelli che differiscono          ter (Tamburini et al., 2016). We emphasise an ap-
    nel grado di supervisione in fase di train-       proach geared to building a single model (rather
    ing. Il modello addestrato con due cicli di       than an ensemble) based on weakly supervised
    bootstrapping usando post da Facebook,            learning, thus favouring (over normalisation) the
    e che quindi impara anche da etichette            aforementioned second option of learning invari-
    “silver”, ha una performance superiore            ant representations, also for theoretical reasons.
    alla versione supervisionata che usa solo         We address the bottleneck of acquiring manually
    dati annotati manualmente. Discutiamo             annotated data by suggesting and showing that
    l’importanza della scelta dei dati di train-      a semi-supervised approach that mainly focuses
    ing e development.                                on tweaking data selection within a bootstrapping
                                                      setting can be successfully pursued for this task.
                                                      Contextually, we show that large amounts of man-
1   Introduction                                      ually annotated data might not be helpful if data
The emergence and abundance of social media           isn’t “of the right kind”.
texts has prompted the urge to develop tools that
are able to process language which is often non-      2       Data selection and bootstrapping
conventional, both in terms of lexicon as well        In adapting a POS tagger to Twitter, we mainly
as grammar. Indeed, models trained on standard        focus on ways of selectively enriching the train-
newswire data heavily suffer when used on data        ing set with additional data. Rather than simply
from a different language variety, especially Twit-   adding large amounts of existing annotated data,
ter (McClosky et al., 2010; Foster et al., 2011;      we investigate ways of selecting smaller amounts
Gimpel et al., 2011; Plank, 2016).                    of more appropriate training instances, possibly
   As a way to equip microblog processing with        even tagged with silver rather than gold labels. As
efficient tools, two ways of developing Twitter-
                                                          1
compliant models have been explored. One option               http://corpora.ficlit.unibo.it/PoSTWITA/
for the model itself, we simply take an off-the-
                                                                  Table 1: Statistics on the additional datasets.
shelf tagger, namely a bi-directional Long Short-                 Data                     Type          Sents    Tokens
Term Memory (bi-LSTM) model (Plank et al.,
                                                                  UD FB                    gold             45        580
2016), which we use with default parameters (see                  UD verb clit+intj        gold            933        26k
Section 3.2) apart from initializing it with Twitter-             FB (all, iter 1)         silver         2243        37k
trained embeddings (Section 3.1).                                 FB (all, iter 2)         silver         3071        47k
   Our first model is trained on the PoSTWITA                     Total added data         gold+silver    4049        74k
training set plus additional gold data selected ac-
cording to two criteria (see below: Two shades
of gold). This model is used to tag a collection              the training pool, and retrained our tagger. We
of Facebook posts in a bootstrapping setting with             used two iterations of indelible self-training (Ab-
two cycles (see below: Bootstrapping via Face-                ney, 2007), i.e., adding automatically tagged data
book). The rationale behind using Facebook as                 where labels do not change once added. Using the
not-so-distant source when targeting Twitter is the           Facebook API through the Facebook-sdk python
following: many Facebook posts of public, non-                library3 , we scraped an average of 100 posts for
personal pages resemble tweets in style, because              each of the following pages, selected on the basis
of brevity and the use of hashtags. However,                  of our intuition and on reasonable site popularity:
differently from random tweets, they are usually
correctly formed grammatically and spelling-wise,                  • sport: corrieredellosport
and often provide more context, which allows for                   • news: Ansa.it, ilsole24ore, lastampa.it
                                                                   • politics: matteorenziufficiale
more accurate tagging.                                             • entertainment: novella2000, alFemminile
                                                                   • travel: viaggiart
Two shades of gold We used the Italian portion
of the latest release (v1.3) of the Universal De-
                                                              We included a second cycle of bootstrap-
pendency (UD) dataset (Nivre et al., 2016), from
                                                              ping, scraping a few more Facebook pages
which we extracted two subsets, according to two
                                                              (soloGossip.it, paesionline, espressonline,
different criteria. First, we selected data on the
                                                              LaGazzettaDelloSport, again with an average
basis of its origin, trying to match the Twitter
                                                              of 100 posts each), and tagging the posts with
training data as close as possible. For this rea-
                                                              the model that had been re-trained on the origi-
son, we used the Facebook subportion (UD FB).
                                                              nal training set plus the first round of Facebook
These are 45 sentences that presumably stem from
                                                              data with silver labels (we refer to the whole
the Italian Facebook help pages and contain ques-
                                                              of the automatically-labelled Facebook data as
tions and short answers.2 Second, by looking at
                                                              FB silver). FB silver was added to the
the confusion matrix of one of the initial models,
                                                              training pool to train the final model. Statistics on
we saw that the model’s performance was espe-
                                                              the obtained data are given in Table 1.4
cially poor for cliticised verbs and interjections,
tags that are also infrequent in the training set (Ta-
ble 2). Therefore, from the Italian UD portion                3        Experiments and Results
we selected any data (in terms of origin/genre)
which contained the VERB CLIT or INTJ tag,                    In this section we describe how we developed the
with the aim to boost the identification of these             two models of the final submission, including all
categories. We refer to this set of 933 sentences as          preprocessing decisions. We highlight the impor-
UD verb clit+intj.                                            tance of choosing an adequate development set to
                                                              identify promising directions.
Bootstrapping via Facebook We augmented
our training set with silver-labelled data. With our          3.1       Experimental Setup
best model trained on the original task data plus
UD verb clit+intj and UD FB, we tagged                        PoSTWITA data In the context of PoSTWITA,
a collection of Facebook posts, added those to                training data was provided to all participants in the
   2                                                               3
     These are labelled as 4-FB in the comment section of         https://pypi.python.org/pypi/facebook-sdk
                                                                   4
UD. Examples include: Prima di effettuare la registrazione.       Due to time constraints we did not add further iterations;
È vero che Facebook sarà a pagamento?                       we cannot judge if we already reached a performance plateau.
                                                                       the Facebook data were more than 90% of the to-
 Table 2: Tag distribution in the original trainset.
 Tag             Explanation          #Tokens   Example
                                                                       kens are in all caps. Unlabelled data used for em-
 NOUN            noun                   16378   cittadini              beddings is preprocessed only with normalisation
 PUNCT           punctuation            14513   ?
 VERB            verb                   12380   apprezzo               of usernames and URLs.
 PROPN           proper noun            11092   Ancona
 DET             determiner              8955   il
 ADP             preposition             8145   per                    Word Embeddings We induced word embed-
 ADV             adverb                  6041   sempre
 PRON            pronoun                 5656   quello                 dings from 5 million Italian tweets (T WITA) from
 ADJ             adjective               5494   mondiale
 HASHTAG         hashtag                 5395   #manovra               Twita (Basile and Nissim, 2013). Vectors were
 ADP A           articulated prep        4465   nella                  created using word2vec (Mikolov and Dean,
 CONJ            coordinating conj       2876   ma
 MENTION         mention                 2592   @InArteMorgan          2013) with default parameters, except for the fact
 AUX             auxiliary verb          2273   potrebbe
 URL             url                     2141   http://t.co/La3opKcp   that we set the dimensions to 64, to match the vec-
 SCONJ           subordinating conj      1521   quando                 tor size of the multilingual (P OLY) embeddings
 INTJ            interjection            1404   fanculo
 NUM             number                  1357   23%                    (Al-Rfou et al., 2013) used by Plank et al. (2016).
 X               anything else            776   s...
 EMO             emoticon                 637                          We dealt with unknown words by adding a “UNK”
 VERB CLIT       verb+clitic              539   vergognarsi            token computing the mean vector of three infre-
 SYM             symbol                   334   →
 PART            particle                   3   ’s                     quent words (“vip!”,“cuora”, “White”).


form of manually labelled tweets. The tags com-
ply with the UD tagset, with a couple of modi-
fications due to the specific genre (emoticons are
labelled with a dedicated tag, for example), and
subjective choices in the treatment of some mor-
phological traits typical of Italian. Specifically,                      Figure 1: Word cloud from the training data.
clitics and articulated prepositions are treated as
one single form (see below: UD fused forms). The
training set contains 6438 tweets, for a total of                      Creation of a realistic internal development set
ca. 115K tokens. The distribution of tags together                     The original task data is distributed as a single
with examples is given in Table 2. The test set                        training file. In initial experiments we saw that
comprises 301 tweets (ca. 4800 tokens).                                performance varied considerably for different ran-
                                                                       dom subsets. This was due to a large bias towards
UD fused forms In the UD scheme for Ital-                              tweets about ‘Monti’ and ‘Grillo’, see Figure 1,
ian, articulated prepositions (ADP A) and cliti-                       but also because of duplicate tweets. We opted
cised verbs (VERB CLIT) are annotated as sep-                          to create the most difficult development set possi-
arate word forms, while in PoSTWITA the origi-                         ble. This development set was achieved by remov-
nal word form (e.g., ‘alla’ or ‘arricchirsi’) is an-                   ing duplicates, and randomly selecting a subset
notated as a whole. In order to get the PoST-                          of tweets that do not mention ‘Grillo’ or ‘Monti’
WITA ADP A and VERB CLIT tags for these                                while maximizing out-of-vocabulary (OOV) rate
fused word forms from UD, we adjust the UCPH                           with respect to the training data. Hence, our inter-
ud-conversion-tools5 (Agić et al., 2016)                              nal development set consisted of 700 tweets with
that propagates head POS information up to the                         an OOV approaching 50%. This represents a more
original form.                                                         realistic testing scenario. Indeed, the baseline (the
                                                                       basic bi-LSTM model), dropped from 94.37 to
Pre-processing of unlabelled data For the                              92.41 computed on the earlier development set
Facebook data, we use a simplistic off-the-                            were we had randomly selected 1/5 of the data,
shelf rule-based tokeniser that segments sen-                          with an OOV of 45% (see Table 4).
tences by punctuation and tokens by whites-
pace.6 We normalise URLs to a single token                             3.2      Model
(http://www.someurl.org) and add a rule
for smileys. Finally, we remove sentences from                         The bidirectional Long Short-Term Memory
                                                                       model bilty7 is illustrated in Figure 2. It is a
   5
       https://github.com/coastalcph/ud-conversion-tools
   6                                                                      7
       https://github.com/bplank/multilingualtokenizer                        https://github.com/bplank/bilstm-aux
                                                        Table 3: Results on the official test set. B EST is
                                                        the highest performing system at PoSTWITA.
                                                            System                           Accuracy
                                                            B EST                                93.19
                                                            S ILVER B OOT (official)             92.25
                                                            G OLD P ICK (unofficial)             91.85
                                                            T N T (on P O STWITA train)          84.83
Figure 2: Hierarchical bi-LSTM model using                  T N T (on S ILVER B OOT data)        85.52
word w~ and character ~c representations.

                                                           Table 3 shows the results on the official test
context bi-LSTM taking as input word embed-
                                                        data for both our models and T N T (Brants, 2000).
dings w.
       ~ Character embeddings ~c are incorporated
                                                        The results show that adding bootstrapped silver
via a hierarchical bi-LSTM using a sequence
                                                        data outperforms the model trained on gold data
bi-LSTM at the lower level (Ballesteros et al.,
                                                        alone. The additional training data included in
2015; Plank et al., 2016). The character repre-
                                                        S ILVER B OOT reduced the OOV rate for the test-
sentation is concatenated with the (learned) word
                                                        set to 41.2% (compared to 46.9% with respect to
embeddings w  ~ to form the input to the context
                                                        the original PoSTWITA training set). Note that,
bi-LSTM at the upper layers. We took default
                                                        on the original randomly selected development set
parameters, i.e., character embeddings set to 100,
                                                        the results were less indicative of the contribution
word embeddings set to 64, 20 iterations of train-
                                                        of the silver data (see Table 4), showing the impor-
ing using Stochastic Gradient Descent, a single
                                                        tance of a carefully selected development set.
bi-LSTM layer and regularization using Gaussian
noise with σ = 0.2 (cdim 100, trainer
sgd, indim 64, iters 20, h layer                        4   What didn’t work
1, sigma 0.2). The model has been shown to
achieve state-of-the-art performance on a range of      In addition to what we found to boost the tagger’s
languages, where the incorporation of character         performance, we also observed what didn’t yield
information was particularly effective (Plank et        any improvements, and in some case even lowered
al., 2016). With these features and settings we         global accuracy. What we experimented with was
train two models on different training sets.            triggered by intuition and previous work, as well
                                                        as what we had already found to be successful,
G OLD P ICK bilty with pre-initialised T WITA           such as selecting additional data to make up for
embeddings, trained on the PoSTWITA train-              under-represented tags in the training set. How-
ing set plus selected gold data (UD FB +                ever, everything we report in this section turned
UD verb clit+intj).                                     out to be either pointless or detrimental.


S ILVER B OOT a bootstrapped version of G OLD -         More data We added to the training data all
P ICK, where FB silver (see Section 2) is also          (train, development, and test) sections from the
added to the training pool, which thus includes         Italian part of UD1.3. While training on selected
both gold and silver data.                              gold data (978 sentences) yielded 95.06% accu-
                                                        racy, adding all of the UD-data (12k sentences
                                                        of newswire, legal and wiki texts) yielded a dis-
3.3   Results on test data                              appointing 94.88% in initial experiments (see Ta-
                                                        ble 4), also considerably slowing down training.
Participants were allowed to submit one official,
and one additional (unofficial) run. Because on            Next, we tried to add more Twitter data from
development data S ILVER B OOT performed better         X LIME, a publicly available corpus with multiple
than G OLD P ICK, we selected the former for our        layers of manually assigned labels, including POS
official submission and the latter for the unofficial   tags, for a total of ca. 8600 tweets and 160K to-
one, making it thus also possible to assess the spe-    kens (Rei et al., 2016). The data isn’t provided
cific contribution of bootstrapping to performance.     as a single gold standard file but in the form of
                                                        converge better and overfit less on the former. In
  Table 4: Results on internal development set.
                                                        this context, the additional signal we use to sup-
  System                               Accuracy         port the learning of each token’s POS tag is the
        Internal dev (prior) OOV: 45%                   token’s degree of ambiguity. Using the informa-
  BASELINE (w/o emb)                  94.37             tion stored in Morph-it!, a lexicon of Italian in-
  +P OLY emb                          94.15             flected forms with their lemma and morphologi-
  +T WITA emb                         94.69             cal features (Zanchetta and Baroni, 2005), we ob-
                                                        tained the number of all different tags potentially
           BASELINE+T WITA emb
                                                        associated to each token. Because the Morph-it!
  +Morphit! coarse MTL                      94.61
                                                        labels are highly fine-grained we derived two dif-
  +Morphit! fine MTL                        94.68
                                                        ferent ambiguity scores, one on the original and
  +UD all                               94.88           one on coarser tags. In neither case the additional
  +gold-picked                          95.06           signal contributed to the tagger’s performance, but
  +gold-picked+silver (1st round)       95.08           we have not explored this direction fully and leave
       Internal dev (realistic) OOV: 50%                it for future investigations.
  BASELINE (incl. T WITA emb)           92.41
  +gold (G OLD P ICK)                   93.19           5   Conclusions
  +gold+silver (S ILVER B OOT)          93.42
       adding more gold (Twitter) data:                 The main conclusion we draw from the experi-
  +X LIME ADJUDICATED (48)              92.58           ments in this paper is that data selection matters,
  +X LIME SINGLE ANNOT.                 91.67           not only for training but also while developing for
  +X LIME ALL (8k)                      92.04           taking informed decisions. Indeed, only after cre-
                                                        ating a carefully designed internal development set
                                                        we obtained stronger evidence of the contribution
separate annotations produced by different judges,      of silver data which is also reflected in the offi-
so that we used MACE (Hovy et al., 2013) to ad-         cial results. We also observe that choosing less but
judicate divergences. Additionally, the tagset is       more targeted data is more effective. For instance,
slightly different from the UD set, so that we had      T WITA embeddings contribute more than generic
to implement a mapping. The results in Table 4          P OLY embeddings which were trained on substan-
show that adding all of the X LIME data declines        tially larger amounts of Wikipedia data. Also, just
performance, despite careful preprocessing to map       blindly adding training data does not help. We
the tags and resolve annotation divergences.            have seen that using the whole of the UD corpus
                                                        is not beneficial to performance when compared
More tag-specific data From the matrix com-             to a small amount of selected gold data, both in
puted on the dev set, it emerged that the most          terms of origin and labels covered. Finally, and
confused categories were NOUN and PROPN. Fol-           most importantly, we have found that adding little
lowing the same principle that led us to add            amounts of not-so-distant silver data obtained via
UD verb clit+intj, we tried to reduce such              bootstrapping resulted in our best model.
confusion by providing additional training data            We believe the low performance observed when
containing proper nouns. This did not yield any         adding xLIME data is likely due to the non-
improvements, neither in terms of global accuracy,      correspondence of tags in the two datasets, which
nor in terms of precision and recall of the two tags.   required a heuristic-based mapping. While this
                                                        is only a speculation that requires further inves-
Multi-task learning Multi-task learning (MTL)           tigation, it seems to indicate that exploring semi-
(Caruana, 1997), namely a learning setting where        supervised strategies is preferrable to producing
more than one task is learnt at the same time, has      idiosyncratic or project-specific gold annotations.
been shown to improve performance for several
NLP tasks (Collobert et al., 2011; Bordes et al.,       Acknowledgments We thank the CIT of the
2012; Liu et al., 2015). Often, what is learnt is       University of Groningen for providing access to
one main task and, additionally, a number of aux-       the Peregrine HPC cluster. Barbara Plank ac-
iliary tasks, where the latter should help the model    knowledges NVIDIA corporation for support.
References                                                Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng,
                                                            Kevin Duh, and Ye-Yi Wang. 2015. Representation
Steven Abney. 2007. Semisupervised learning for             learning using multi-task deep neural networks for
   computational linguistics. CRC Press.                    semantic classification and information retrieval. In
                                                            Proc. NAACL.
Željko Agić, Anders Johannsen, Barbara Plank, Héctor
   Martı́nez Alonso, Natalie Schluter, and Anders         David McClosky, Eugene Charniak, and Mark John-
   Søgaard. 2016. Multilingual projection for parsing       son. 2010. Automatic domain adaptation for pars-
   truly low-resource languages. Transactions of the        ing. In NAACL-HLT.
   Association for Computational Linguistics (TACL),
   4:301–312.                                             T Mikolov and J Dean. 2013. Distributed representa-
                                                            tions of words and phrases and their compositional-
Rami Al-Rfou, Bryan Perozzi, and Steven Skiena.             ity. Advances in neural information processing sys-
  2013.     Polyglot: Distributed word represen-            tems.
  tations for multilingual NLP.   arXiv preprint
  arXiv:1307.1662.                                        Joakim Nivre et al. 2016. Universal dependencies 1.3.
                                                            LINDAT/CLARIN digital library at the Institute of
Miguel Ballesteros, Chris Dyer, and Noah A. Smith.          Formal and Applied Linguistics, Charles University
  2015. Improved transition-based parsing by mod-           in Prague.
  eling characters instead of words with lstms. In
  EMNLP.                                                  Barbara Plank, Anders Søgaard, and Yoav Goldberg.
                                                            2016. Multilingual part-of-speech tagging with
Valerio Basile and Malvina Nissim. 2013. Sentiment          bidirectional long short-term memory models and
  analysis on italian tweets. In Proceedings of the 4th     auxiliary loss. In ACL.
  Workshop on Computational Approaches to Subjec-
  tivity, Sentiment and Social Media Analysis, pages      Barbara Plank. 2016. What to do about non-standard
  100–107.                                                  (or non-canonical) language in NLP. In KONVENS.
Antoine Bordes, Xavier Glorot, Jason Weston, and          Luis Rei, Dunja Mladenic, and Simon Krek. 2016. A
  Yoshua Bengio. 2012. Joint learning of words              multilingual social media linguistic corpus. In Con-
  and meaning representations for open-text semantic        ference of CMC and Social Media Corpora for the
  parsing. In AISTATS, volume 351, pages 423–424.           Humanities.
Thorsten Brants. 2000. Tnt: a statistical part-of-        Fabio Tamburini, Cristina Bosco, Alessandro Mazzei,
  speech tagger. In ANLP.                                   and Andrea Bolioli. 2016. Overview of the
Rich Caruana. 1997. Multitask learning. Machine             EVALITA 2016 Part Of Speech on TWitter for ITAl-
  Learning, 28:41–75.                                       ian Task. In Pierpaolo Basile, Anna Corazza, Franco
                                                            Cutugno, Simonetta Montemagni, Malvina Nis-
Ronan Collobert, Jason Weston, Léon Bottou, Michael        sim, Viviana Patti, Giovanni Semeraro and Rachele
  Karlen, Koray Kavukcuoglu, and Pavel Kuksa.               Sprugnoli, editors, Proceedings of Third Italian
  2011. Natural language processing (almost) from           Conference on Computational Linguistics (CLiC-it
  scratch. Journal of Machine Learning Research,            2016) & Fifth Evaluation Campaign of Natural Lan-
  12(Aug):2493–2537.                                        guage Processing and Speech Tools for Italian. Final
                                                            Workshop (EVALITA 2016). Associazione Italiana di
Jacob Eisenstein. 2013. What to do about bad lan-           Linguistica Computazionale (AILC).
   guage on the internet. In Proceedings of the An-
   nual Conference of the North American Chapter          Eros Zanchetta and Marco Baroni. 2005. Morph-it!
   of the Association for Computational Linguistics         A free corpus-based morphological resource for the
   (NAACL), pages 359–369, Atlanta.                         Italian language. Corpus Linguistics 2005, 1(1).

Jennifer Foster, Ozlem Cetinoglu, Joachim Wagner,
   Josef Le Roux, Joakim Nivre, Deirde Hogan, and
   Josef van Genabith. 2011. From news to comments:
   Resources and benchmarks for parsing the language
   of Web 2.0. In IJCNLP.
Kevin Gimpel, Nathan Schneider, Brendan O’Connor,
  Dipanjan Das, Daniel Mills, Jacob Eisenstein,
  Michael Heilman, Dani Yogatama, Jeffrey Flanigan,
  and Noah A. Smith. 2011. Part-of-Speech Tagging
  for Twitter: Annotation, Features, and Experiments.
  In Proceedings of ACL.
Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani,
  and Eduard Hovy. 2013. Learning whom to trust
  with MACE. In NAACL.