When silver glitters more than gold: Bootstrapping an Italian part-of-speech tagger for Twitter Barbara Plank Malvina Nissim University of Groningen University of Groningen The Netherlands The Netherlands b.plank@rug.nl m.nissim@rug.nl Abstract is to transform Twitter language back to what pre- trained models already know via normalisation op- English. We bootstrap a state-of-the-art erations, so that existing tools are more successful part-of-speech tagger to tag Italian Twit- on such different data. The other option is to create ter data, in the context of the Evalita 2016 native models by training them on labelled Twitter PoSTWITA shared task. We show that data. The drawback of the first option is that it’s training the tagger on native Twitter data not clear what norm to target: “what is standard enriched with little amounts of specifi- language?” (Eisenstein, 2013; Plank, 2016), and cally selected gold data and additional implementing normalisation procedures requires silver-labelled data scraped from Face- quite a lot of manual intervention and subjective book, yields better results than using large decisions. The drawback of the second option is amounts of manually annotated data from that manually annotated Twitter data isn’t readily a mix of genres. available, and it is costly to produce. In this paper, we report on our participation Italiano. Nell’ambito della campagna di in PoSTWITA1 , the EVALITA 2016 shared task valutazione PoSTWITA di Evalita 2016, on Italian Part-of-Speech (POS) tagging for Twit- addestriamo due modelli che differiscono ter (Tamburini et al., 2016). We emphasise an ap- nel grado di supervisione in fase di train- proach geared to building a single model (rather ing. Il modello addestrato con due cicli di than an ensemble) based on weakly supervised bootstrapping usando post da Facebook, learning, thus favouring (over normalisation) the e che quindi impara anche da etichette aforementioned second option of learning invari- “silver”, ha una performance superiore ant representations, also for theoretical reasons. alla versione supervisionata che usa solo We address the bottleneck of acquiring manually dati annotati manualmente. Discutiamo annotated data by suggesting and showing that l’importanza della scelta dei dati di train- a semi-supervised approach that mainly focuses ing e development. on tweaking data selection within a bootstrapping setting can be successfully pursued for this task. Contextually, we show that large amounts of man- 1 Introduction ually annotated data might not be helpful if data The emergence and abundance of social media isn’t “of the right kind”. texts has prompted the urge to develop tools that are able to process language which is often non- 2 Data selection and bootstrapping conventional, both in terms of lexicon as well In adapting a POS tagger to Twitter, we mainly as grammar. Indeed, models trained on standard focus on ways of selectively enriching the train- newswire data heavily suffer when used on data ing set with additional data. Rather than simply from a different language variety, especially Twit- adding large amounts of existing annotated data, ter (McClosky et al., 2010; Foster et al., 2011; we investigate ways of selecting smaller amounts Gimpel et al., 2011; Plank, 2016). of more appropriate training instances, possibly As a way to equip microblog processing with even tagged with silver rather than gold labels. As efficient tools, two ways of developing Twitter- 1 compliant models have been explored. One option http://corpora.ficlit.unibo.it/PoSTWITA/ for the model itself, we simply take an off-the- Table 1: Statistics on the additional datasets. shelf tagger, namely a bi-directional Long Short- Data Type Sents Tokens Term Memory (bi-LSTM) model (Plank et al., UD FB gold 45 580 2016), which we use with default parameters (see UD verb clit+intj gold 933 26k Section 3.2) apart from initializing it with Twitter- FB (all, iter 1) silver 2243 37k trained embeddings (Section 3.1). FB (all, iter 2) silver 3071 47k Our first model is trained on the PoSTWITA Total added data gold+silver 4049 74k training set plus additional gold data selected ac- cording to two criteria (see below: Two shades of gold). This model is used to tag a collection the training pool, and retrained our tagger. We of Facebook posts in a bootstrapping setting with used two iterations of indelible self-training (Ab- two cycles (see below: Bootstrapping via Face- ney, 2007), i.e., adding automatically tagged data book). The rationale behind using Facebook as where labels do not change once added. Using the not-so-distant source when targeting Twitter is the Facebook API through the Facebook-sdk python following: many Facebook posts of public, non- library3 , we scraped an average of 100 posts for personal pages resemble tweets in style, because each of the following pages, selected on the basis of brevity and the use of hashtags. However, of our intuition and on reasonable site popularity: differently from random tweets, they are usually correctly formed grammatically and spelling-wise, • sport: corrieredellosport and often provide more context, which allows for • news: Ansa.it, ilsole24ore, lastampa.it • politics: matteorenziufficiale more accurate tagging. • entertainment: novella2000, alFemminile • travel: viaggiart Two shades of gold We used the Italian portion of the latest release (v1.3) of the Universal De- We included a second cycle of bootstrap- pendency (UD) dataset (Nivre et al., 2016), from ping, scraping a few more Facebook pages which we extracted two subsets, according to two (soloGossip.it, paesionline, espressonline, different criteria. First, we selected data on the LaGazzettaDelloSport, again with an average basis of its origin, trying to match the Twitter of 100 posts each), and tagging the posts with training data as close as possible. For this rea- the model that had been re-trained on the origi- son, we used the Facebook subportion (UD FB). nal training set plus the first round of Facebook These are 45 sentences that presumably stem from data with silver labels (we refer to the whole the Italian Facebook help pages and contain ques- of the automatically-labelled Facebook data as tions and short answers.2 Second, by looking at FB silver). FB silver was added to the the confusion matrix of one of the initial models, training pool to train the final model. Statistics on we saw that the model’s performance was espe- the obtained data are given in Table 1.4 cially poor for cliticised verbs and interjections, tags that are also infrequent in the training set (Ta- ble 2). Therefore, from the Italian UD portion 3 Experiments and Results we selected any data (in terms of origin/genre) which contained the VERB CLIT or INTJ tag, In this section we describe how we developed the with the aim to boost the identification of these two models of the final submission, including all categories. We refer to this set of 933 sentences as preprocessing decisions. We highlight the impor- UD verb clit+intj. tance of choosing an adequate development set to identify promising directions. Bootstrapping via Facebook We augmented our training set with silver-labelled data. With our 3.1 Experimental Setup best model trained on the original task data plus UD verb clit+intj and UD FB, we tagged PoSTWITA data In the context of PoSTWITA, a collection of Facebook posts, added those to training data was provided to all participants in the 2 3 These are labelled as 4-FB in the comment section of https://pypi.python.org/pypi/facebook-sdk 4 UD. Examples include: Prima di effettuare la registrazione. Due to time constraints we did not add further iterations; È vero che Facebook sarà a pagamento? we cannot judge if we already reached a performance plateau. the Facebook data were more than 90% of the to- Table 2: Tag distribution in the original trainset. Tag Explanation #Tokens Example kens are in all caps. Unlabelled data used for em- NOUN noun 16378 cittadini beddings is preprocessed only with normalisation PUNCT punctuation 14513 ? VERB verb 12380 apprezzo of usernames and URLs. PROPN proper noun 11092 Ancona DET determiner 8955 il ADP preposition 8145 per Word Embeddings We induced word embed- ADV adverb 6041 sempre PRON pronoun 5656 quello dings from 5 million Italian tweets (T WITA) from ADJ adjective 5494 mondiale HASHTAG hashtag 5395 #manovra Twita (Basile and Nissim, 2013). Vectors were ADP A articulated prep 4465 nella created using word2vec (Mikolov and Dean, CONJ coordinating conj 2876 ma MENTION mention 2592 @InArteMorgan 2013) with default parameters, except for the fact AUX auxiliary verb 2273 potrebbe URL url 2141 http://t.co/La3opKcp that we set the dimensions to 64, to match the vec- SCONJ subordinating conj 1521 quando tor size of the multilingual (P OLY) embeddings INTJ interjection 1404 fanculo NUM number 1357 23% (Al-Rfou et al., 2013) used by Plank et al. (2016). X anything else 776 s... EMO emoticon 637 We dealt with unknown words by adding a “UNK” VERB CLIT verb+clitic 539 vergognarsi token computing the mean vector of three infre- SYM symbol 334 → PART particle 3 ’s quent words (“vip!”,“cuora”, “White”). form of manually labelled tweets. The tags com- ply with the UD tagset, with a couple of modi- fications due to the specific genre (emoticons are labelled with a dedicated tag, for example), and subjective choices in the treatment of some mor- phological traits typical of Italian. Specifically, Figure 1: Word cloud from the training data. clitics and articulated prepositions are treated as one single form (see below: UD fused forms). The training set contains 6438 tweets, for a total of Creation of a realistic internal development set ca. 115K tokens. The distribution of tags together The original task data is distributed as a single with examples is given in Table 2. The test set training file. In initial experiments we saw that comprises 301 tweets (ca. 4800 tokens). performance varied considerably for different ran- dom subsets. This was due to a large bias towards UD fused forms In the UD scheme for Ital- tweets about ‘Monti’ and ‘Grillo’, see Figure 1, ian, articulated prepositions (ADP A) and cliti- but also because of duplicate tweets. We opted cised verbs (VERB CLIT) are annotated as sep- to create the most difficult development set possi- arate word forms, while in PoSTWITA the origi- ble. This development set was achieved by remov- nal word form (e.g., ‘alla’ or ‘arricchirsi’) is an- ing duplicates, and randomly selecting a subset notated as a whole. In order to get the PoST- of tweets that do not mention ‘Grillo’ or ‘Monti’ WITA ADP A and VERB CLIT tags for these while maximizing out-of-vocabulary (OOV) rate fused word forms from UD, we adjust the UCPH with respect to the training data. Hence, our inter- ud-conversion-tools5 (Agić et al., 2016) nal development set consisted of 700 tweets with that propagates head POS information up to the an OOV approaching 50%. This represents a more original form. realistic testing scenario. Indeed, the baseline (the basic bi-LSTM model), dropped from 94.37 to Pre-processing of unlabelled data For the 92.41 computed on the earlier development set Facebook data, we use a simplistic off-the- were we had randomly selected 1/5 of the data, shelf rule-based tokeniser that segments sen- with an OOV of 45% (see Table 4). tences by punctuation and tokens by whites- pace.6 We normalise URLs to a single token 3.2 Model (http://www.someurl.org) and add a rule for smileys. Finally, we remove sentences from The bidirectional Long Short-Term Memory model bilty7 is illustrated in Figure 2. It is a 5 https://github.com/coastalcph/ud-conversion-tools 6 7 https://github.com/bplank/multilingualtokenizer https://github.com/bplank/bilstm-aux Table 3: Results on the official test set. B EST is the highest performing system at PoSTWITA. System Accuracy B EST 93.19 S ILVER B OOT (official) 92.25 G OLD P ICK (unofficial) 91.85 T N T (on P O STWITA train) 84.83 Figure 2: Hierarchical bi-LSTM model using T N T (on S ILVER B OOT data) 85.52 word w~ and character ~c representations. Table 3 shows the results on the official test context bi-LSTM taking as input word embed- data for both our models and T N T (Brants, 2000). dings w. ~ Character embeddings ~c are incorporated The results show that adding bootstrapped silver via a hierarchical bi-LSTM using a sequence data outperforms the model trained on gold data bi-LSTM at the lower level (Ballesteros et al., alone. The additional training data included in 2015; Plank et al., 2016). The character repre- S ILVER B OOT reduced the OOV rate for the test- sentation is concatenated with the (learned) word set to 41.2% (compared to 46.9% with respect to embeddings w ~ to form the input to the context the original PoSTWITA training set). Note that, bi-LSTM at the upper layers. We took default on the original randomly selected development set parameters, i.e., character embeddings set to 100, the results were less indicative of the contribution word embeddings set to 64, 20 iterations of train- of the silver data (see Table 4), showing the impor- ing using Stochastic Gradient Descent, a single tance of a carefully selected development set. bi-LSTM layer and regularization using Gaussian noise with σ = 0.2 (cdim 100, trainer sgd, indim 64, iters 20, h layer 4 What didn’t work 1, sigma 0.2). The model has been shown to achieve state-of-the-art performance on a range of In addition to what we found to boost the tagger’s languages, where the incorporation of character performance, we also observed what didn’t yield information was particularly effective (Plank et any improvements, and in some case even lowered al., 2016). With these features and settings we global accuracy. What we experimented with was train two models on different training sets. triggered by intuition and previous work, as well as what we had already found to be successful, G OLD P ICK bilty with pre-initialised T WITA such as selecting additional data to make up for embeddings, trained on the PoSTWITA train- under-represented tags in the training set. How- ing set plus selected gold data (UD FB + ever, everything we report in this section turned UD verb clit+intj). out to be either pointless or detrimental. S ILVER B OOT a bootstrapped version of G OLD - More data We added to the training data all P ICK, where FB silver (see Section 2) is also (train, development, and test) sections from the added to the training pool, which thus includes Italian part of UD1.3. While training on selected both gold and silver data. gold data (978 sentences) yielded 95.06% accu- racy, adding all of the UD-data (12k sentences of newswire, legal and wiki texts) yielded a dis- 3.3 Results on test data appointing 94.88% in initial experiments (see Ta- ble 4), also considerably slowing down training. Participants were allowed to submit one official, and one additional (unofficial) run. Because on Next, we tried to add more Twitter data from development data S ILVER B OOT performed better X LIME, a publicly available corpus with multiple than G OLD P ICK, we selected the former for our layers of manually assigned labels, including POS official submission and the latter for the unofficial tags, for a total of ca. 8600 tweets and 160K to- one, making it thus also possible to assess the spe- kens (Rei et al., 2016). The data isn’t provided cific contribution of bootstrapping to performance. as a single gold standard file but in the form of converge better and overfit less on the former. In Table 4: Results on internal development set. this context, the additional signal we use to sup- System Accuracy port the learning of each token’s POS tag is the Internal dev (prior) OOV: 45% token’s degree of ambiguity. Using the informa- BASELINE (w/o emb) 94.37 tion stored in Morph-it!, a lexicon of Italian in- +P OLY emb 94.15 flected forms with their lemma and morphologi- +T WITA emb 94.69 cal features (Zanchetta and Baroni, 2005), we ob- tained the number of all different tags potentially BASELINE+T WITA emb associated to each token. Because the Morph-it! +Morphit! coarse MTL 94.61 labels are highly fine-grained we derived two dif- +Morphit! fine MTL 94.68 ferent ambiguity scores, one on the original and +UD all 94.88 one on coarser tags. In neither case the additional +gold-picked 95.06 signal contributed to the tagger’s performance, but +gold-picked+silver (1st round) 95.08 we have not explored this direction fully and leave Internal dev (realistic) OOV: 50% it for future investigations. BASELINE (incl. T WITA emb) 92.41 +gold (G OLD P ICK) 93.19 5 Conclusions +gold+silver (S ILVER B OOT) 93.42 adding more gold (Twitter) data: The main conclusion we draw from the experi- +X LIME ADJUDICATED (48) 92.58 ments in this paper is that data selection matters, +X LIME SINGLE ANNOT. 91.67 not only for training but also while developing for +X LIME ALL (8k) 92.04 taking informed decisions. Indeed, only after cre- ating a carefully designed internal development set we obtained stronger evidence of the contribution separate annotations produced by different judges, of silver data which is also reflected in the offi- so that we used MACE (Hovy et al., 2013) to ad- cial results. We also observe that choosing less but judicate divergences. Additionally, the tagset is more targeted data is more effective. For instance, slightly different from the UD set, so that we had T WITA embeddings contribute more than generic to implement a mapping. The results in Table 4 P OLY embeddings which were trained on substan- show that adding all of the X LIME data declines tially larger amounts of Wikipedia data. Also, just performance, despite careful preprocessing to map blindly adding training data does not help. We the tags and resolve annotation divergences. have seen that using the whole of the UD corpus is not beneficial to performance when compared More tag-specific data From the matrix com- to a small amount of selected gold data, both in puted on the dev set, it emerged that the most terms of origin and labels covered. Finally, and confused categories were NOUN and PROPN. Fol- most importantly, we have found that adding little lowing the same principle that led us to add amounts of not-so-distant silver data obtained via UD verb clit+intj, we tried to reduce such bootstrapping resulted in our best model. confusion by providing additional training data We believe the low performance observed when containing proper nouns. This did not yield any adding xLIME data is likely due to the non- improvements, neither in terms of global accuracy, correspondence of tags in the two datasets, which nor in terms of precision and recall of the two tags. required a heuristic-based mapping. While this is only a speculation that requires further inves- Multi-task learning Multi-task learning (MTL) tigation, it seems to indicate that exploring semi- (Caruana, 1997), namely a learning setting where supervised strategies is preferrable to producing more than one task is learnt at the same time, has idiosyncratic or project-specific gold annotations. been shown to improve performance for several NLP tasks (Collobert et al., 2011; Bordes et al., Acknowledgments We thank the CIT of the 2012; Liu et al., 2015). Often, what is learnt is University of Groningen for providing access to one main task and, additionally, a number of aux- the Peregrine HPC cluster. Barbara Plank ac- iliary tasks, where the latter should help the model knowledges NVIDIA corporation for support. References Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. 2015. Representation Steven Abney. 2007. Semisupervised learning for learning using multi-task deep neural networks for computational linguistics. CRC Press. semantic classification and information retrieval. In Proc. NAACL. Željko Agić, Anders Johannsen, Barbara Plank, Héctor Martı́nez Alonso, Natalie Schluter, and Anders David McClosky, Eugene Charniak, and Mark John- Søgaard. 2016. Multilingual projection for parsing son. 2010. Automatic domain adaptation for pars- truly low-resource languages. Transactions of the ing. In NAACL-HLT. Association for Computational Linguistics (TACL), 4:301–312. T Mikolov and J Dean. 2013. Distributed representa- tions of words and phrases and their compositional- Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. ity. Advances in neural information processing sys- 2013. Polyglot: Distributed word represen- tems. tations for multilingual NLP. arXiv preprint arXiv:1307.1662. Joakim Nivre et al. 2016. Universal dependencies 1.3. LINDAT/CLARIN digital library at the Institute of Miguel Ballesteros, Chris Dyer, and Noah A. Smith. Formal and Applied Linguistics, Charles University 2015. Improved transition-based parsing by mod- in Prague. eling characters instead of words with lstms. In EMNLP. Barbara Plank, Anders Søgaard, and Yoav Goldberg. 2016. Multilingual part-of-speech tagging with Valerio Basile and Malvina Nissim. 2013. Sentiment bidirectional long short-term memory models and analysis on italian tweets. In Proceedings of the 4th auxiliary loss. In ACL. Workshop on Computational Approaches to Subjec- tivity, Sentiment and Social Media Analysis, pages Barbara Plank. 2016. What to do about non-standard 100–107. (or non-canonical) language in NLP. In KONVENS. Antoine Bordes, Xavier Glorot, Jason Weston, and Luis Rei, Dunja Mladenic, and Simon Krek. 2016. A Yoshua Bengio. 2012. Joint learning of words multilingual social media linguistic corpus. In Con- and meaning representations for open-text semantic ference of CMC and Social Media Corpora for the parsing. In AISTATS, volume 351, pages 423–424. Humanities. Thorsten Brants. 2000. Tnt: a statistical part-of- Fabio Tamburini, Cristina Bosco, Alessandro Mazzei, speech tagger. In ANLP. and Andrea Bolioli. 2016. Overview of the Rich Caruana. 1997. Multitask learning. Machine EVALITA 2016 Part Of Speech on TWitter for ITAl- Learning, 28:41–75. ian Task. In Pierpaolo Basile, Anna Corazza, Franco Cutugno, Simonetta Montemagni, Malvina Nis- Ronan Collobert, Jason Weston, Léon Bottou, Michael sim, Viviana Patti, Giovanni Semeraro and Rachele Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Sprugnoli, editors, Proceedings of Third Italian 2011. Natural language processing (almost) from Conference on Computational Linguistics (CLiC-it scratch. Journal of Machine Learning Research, 2016) & Fifth Evaluation Campaign of Natural Lan- 12(Aug):2493–2537. guage Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016). Associazione Italiana di Jacob Eisenstein. 2013. What to do about bad lan- Linguistica Computazionale (AILC). guage on the internet. In Proceedings of the An- nual Conference of the North American Chapter Eros Zanchetta and Marco Baroni. 2005. Morph-it! of the Association for Computational Linguistics A free corpus-based morphological resource for the (NAACL), pages 359–369, Atlanta. Italian language. Corpus Linguistics 2005, 1(1). Jennifer Foster, Ozlem Cetinoglu, Joachim Wagner, Josef Le Roux, Joakim Nivre, Deirde Hogan, and Josef van Genabith. 2011. From news to comments: Resources and benchmarks for parsing the language of Web 2.0. In IJCNLP. Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. 2011. Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments. In Proceedings of ACL. Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani, and Eduard Hovy. 2013. Learning whom to trust with MACE. In NAACL.