Character Embeddings PoS Tagger vs HMM Tagger for Tweets Giuseppe Attardi, Maria Simi Dipartimento di Informatica Università di Pisa Largo B. Pontecorvo, 3 I-56127 Pisa, Italy {attardi, simi}@di.unipi.it low the guidelines proposed by the Universal Abstract Dependencies (UD) project for Italian1, in partic- ular with respect to tokenization and tag set, with English. The paper describes our sub- minor changes due to the specificity of the text missions to the task on PoS tagging for genre. A few specific tags (EMO, URL, EMAIL, Italian Social Media Texts (PoSTWITA) HASHTAG and MENTION), have been in fact at Evalita 2016. We compared two ap- added for typical morphological categories in proaches: a traditional HMM trigram Pos social media texts, like emoticons and emoji’s, tagger and a Deep Learning PoS tagger web URL, email addresses, hashtags and men- using both character-level and word-level tions. embeddings. The character-level embed- The challenge for PoS tagging of microposts dings performed better proving that they consists in dealing with misspelled, colloquial or can provide a finer representation of broken words as well as in overcoming the lack words that allows coping with the idio- of context and proper uppercasing, which pro- syncrasies and irregularities of the lan- vide helpful hints when analysing more standard guage in microposts. texts. We conducted preparatory work that consisted Italiano. Questo articolo descrive la no- in customizing some available lexical and train- stra partecipazione al task di PoS tag- ing resources for the task: in section 2 and 3 we ging for Italian Social Media Texts (Po- will describe such a process. STWITA) di Evalita 2016. Abbiamo con- We decided to address the research question of frontato due approcci: un PoS tagger comparing the relative performance of two dif- tradizionale basato su HMM a trigrammi ferent approaches to PoS tagging: the traditional e un PoS Tagger con Deep Learning che word-based approach, based on a Hidden Mar- usa embeddings sia a livello di caratteri kov Model PoS tagger, with a Deep Learning che di parole. Gli embedding a caratteri approach that exploits character-level embed- hanno fornito un miglior risultato, dimo- dings (Ma and Hovy, 2016). Section 4 and 5 de- strando che riescono a fornire una rap- scribe the two approaches in detail. presentazione più fine delle parole che consente di trattare le idiosincrasie e ir- 2 Building a larger training resource regolarità del linguaggio usato nei mi- cropost. The gold training set provided for the task con- sists in a collection of 6,640 Italian tweets from the Evalita 2014 SENTIPOLC corpus (corre- 1 Introduction sponding to 127,843 word tokens). Given the The PoS tagging challenge at Evalita 2016 was relative small size of the resource, we extended it targeted to the analysis of Italian micropost lan- by leveraging on existing resources. We used the guage, in particular the language of Twitter corpus previously used in the organization of the posts. The organizers provided an annotated Evalita 2009 task on PoS Tagging (Attardi and training corpus, obtained by annotating a collec- tion of Italian tweets from the earlier Evalita 1 http://universaldependencies.org/it/pos/index.html 2014 SENTIPOLC corpus. The annotations fol- Simi 2009), consisting in articles from the news- The Tanl PoS tagger is derived from a rewrit- paper “La Repubblica”, some articles from the ing in C++ of HunPos (Halácsy, et al. 2007), an Italian Wikipedia, and portions of the Universal open source trigram tagger, written in OCaml. Dependencies Italian corpus and a small collec- The tagger estimates the probability of a se- tion of annotated Italian tweets. Table 1 provides quence of labels t1…tT for a sequence of words details of the composition of the training re- w1…wT from the probabilities of trigrams: source. argmax ( | ) ( | , ) ( | , ) Resource Number of tokens … repubblica.pos 112,593 The trigram probabilities are estimated smooth- extra.pos 130 ing by linear interpolation the probabilities of quest.pos 9,826 unigrams, bigrams and trigrams: isst_tanl.pos 80,794 tut.pos 97,558 ( | , )= ( ) ( | ) ( | ) it-twitter.pos 1,018 where are maximum likelihood estimates and Evalita 2016 121,405 1 + 2 + 3 = 1. Total 423,324 An approximate Viterbi algorithm is used for Table 1. Composition of the training set. finding the sequence of tags with highest proba- bility, which exploit beam search to prune un- The tag set was converted to the Universal De- likely alternative paths. pendencies schema taking into account the vari- The tagger uses a suffix guessing algorithm ants introduced in the task (different tokenization for dealing with unseen words. The tagger com- of articulated prepositions and introduction of putes the probability distribution of tags for each ADP_A). suffix, by building a trie from the suffixes, up to During development, the gold dataset provid- a maximum length (default 10), of words appear- ed by the organizers was split into two parts: a ing less than n (default 10) times in the training subset of about 105,300 tokens was used for corpus. Actually two suffix tries are built: one training, while the remaining tokens were used as for words beginning with uppercase, one for validation set (~22,500 tokens). lowercase words. A word at the beginning of a sentence is looked up in its lowercase variant. 3 Normalization of URLs, emoticons Special handling is provided for numbers and and emoji’s HTML entities. In order to facilitate the tagging of morphologi- The tagger can also be given a file with a list cal categories specifically introduced for social of possible tags and lemmas for each word, in media texts, we applied a pre-processing step for order to initialize its lexicon. In our experiments normalizing the word forms. This was done by we used a lexicon of 130 thousands Italian means of a set of rewriting rules based on regular words. expressions. These rules are quite straightforward for 5 Character-level Embeddings URLs, hashtags, emails and mentions, while the Traditional techniques of statistical machine identification of emoticons and emoji’s required learning usually require, to perform best, task a set of carefully handcrafted rules because of specific selection and tuning of hand-crafted fea- their variety and higher degree of ambiguity. tures as well as resources like lexicons or gazet- teers, which are costly to develop. 4 The traditional approach: the TANL Recently, end-to-end approaches based on tagger Deep Learning architectures have proved to be Linear statistical models, such as Hidden Markov equally effective, without the use of handcrafted Models (HMM) or Conditional Random Fields features or any data pre-processing, exploiting (CRF) are often used for sequence labeling (PoS word embeddings as only features. tagging and NER). In order to deal with sequences, Collobert et al. In our first experiment, we used the Tanl Pos (2011) proposed a Convolutional Neural Net- Tagger, based on a second order HMM. works (CNN), trained to maximize the overall sentence level log-likehood of tag sequences, which was able to achieve state of the art accura- cy on English PoS tagging. More recently, Re- Team5_UNOFFICIAL 0.9184 4371 cursive Neural Networks (RNN) have been pro- UNIPI 0.9157 4358 posed. UNIPI_UNOFFICIAL 0.9153 4356 The word embeddings exploited as features in Table 2. PoSTWITA top official results. these systems proved suitable to represent words in well formed texts like the news articles used in After submission we performed another experi- the CoNNL PoS tagging benchmarks. ment with the BiLSTM tagger, increasing the We conjectured that dealing with the noisy dimension of word embeddings from 100 to 200 and malformed texts in microposts might require and obtained an accuracy of 92.50% features at a finer level than words, i.e. to use (4402/4759). character-level embeddings. Hence we devised To further test the ability of the character-level an experiment to explore the effectiveness of embeddings to deal completely autonomously combining both character-level and word-level with the original writings of tweets, we per- embeddings in PoS tagging of tweets. formed a further experiment where we supply the We based our experiments on the work by Ma original text of tweets without normalization. and Hovy (2016), who propose an approach to This experiment achieved an accuracy of 91.87% sequence labeling using a bi-directional long- (4372/4759), proving that indeed the RNN char- short term memory (BiLSTM) neural network, a acter-level approach is capable of learning by variant of RNN. On top of the BiLSTM, a se- itself even unusual tokens, recognizing quite well quential CRF layer can be used to jointly decode also emoticons and emoji’s, without any need of labels for the whole sentence. preconceived linguistic knowledge, encoded in The implementation of the BiLSTM network an ad-hoc rule system. is done in Lasagne 2 , a lightweight library for building and training neural networks in 7 Discussion Theano3. For training the BiLSTM tagger we used word While the results with the two approaches, used embeddings for tweets created using the fastText in the official and unofficial run, are strikingly utility4 (Bojanowski et al., 2016) on a collection of close (a difference of only two errors), the two 141 million Italian tweets retrieved over the period taggers differ significantly on the type of errors from May to September 2016 using the Twitter they make. API. Selection of Italian tweets was achieved by 7.1 Error analysis using a query containing a list of the 200 most common Italian words. Table 3 reports a breakdown of the errors over The embeddings were created with dimension PoS categories, for both systems, in order to ap- 100, using a window of 5 and retaining words preciate the difference in behaviour. Note that a with a minimum count of 100, for a total of 245 single PoS mismatch is counted twice, once for thousands words. each PoS involved. Three cases of misspelled PoS in the gold test were corrected before this 6 Results analysis. The following table reports the top 9 official BiLSTM HMM scores obtained by participant systems. URL 5 2 EMO 36 6 DET 32 37 Submission Accuracy Correct AUX 27 19 Team1 0.9319 4435 CONJ 5 2 Team2 0.9285 4419 NOUN 132 155 Team3_UNOFFICIAL 0.9279 4416 PUNCT 8 5 Team4 0.9270 4412 MENTION 1 0 Team3 0.9245 4400 NUM 16 14 Team5 0.9224 4390 ADP_A 8 7 ADV 44 51 2 https://github.com/Lasagne VERB_CLIT 4 3 3 https://github.com/Theano/Theano ADP 26 27 4 https://github.com/facebookresearch/fastText.git SCONJ 15 26 PROPN 136 150 8 Conclusions INTJ 44 34 VERB 110 83 We explored using both a traditional HMM tri- X 34 31 gram PoS tagger and a Deep Learning PoS Tag- ger that uses both character and word-level em- ADJ 67 86 beddings, in the analysis of Italian tweets. SYM 3 5 The latter tagger uses embeddings as only fea- PRON 42 56 tures and no lexicon nor other linguistic resource. HASHTAG 1 1 The tagger performs surprisingly well, with an TOTAL 796 800 unofficial run that ranks among the top 5. This Table 3. Breakdown of errors over PoS types. confirms our conjecture that character-level em- beddings are able of coping with the idiosyncra- As previously mentioned, social media specific sies and irregular writings in microposts. tags are not the most difficult problem. To be fair, we noticed that the official BiLSTM run is Acknowledgments plagued by a suspicious high number of errors in identifying EMO’s. However, by checking the We gratefully acknowledge the support by the steps in the experiment, we discovered that this University of Pisa through project PRA and by poor performance was due to a mistake in the NVIDIA Corporation through a donation of a normalization step. Tesla K40 GPU used in the experiments. Confusion between NOUN and PROPN repre- sents the largest source of errors. In the official References run there are 66 errors (35 PROPN tagged as Giuseppe Attardi and Maria Simi. 2009. Overview of NOUN, 33 NOUN tagged as PROPN), corre- the EVALITA. Part-of-Speech Tagging Task. Pro- sponding to nearly 17% of all the errors. The tra- ceedings of Workshop Evalita 2009, Reggio Emi- ditional unofficial run does even worse: 19% of lia. the errors are due to this confusion. Piotr Bojanowski, Edouard Grave, Armand Joulin, Both taggers are weak in dealing with im- and Tomas Mikolov. 2016. Enriching Word Vec- proper use of case (lower case proper names and tors with Subword Information. all caps texts), which is very common in Twitter https://arxiv.org/abs/1607.04606 posts. This could be because the training set is Ronan Collobert, JasonWeston, Léon Bottou, Michael still dominated by more regular texts where the Karlen, Koray Kavukcuoglu, Pavel Kuksa. 2011. case is a strong indication of proper names. In Natural Language Processing (Almost) from addition, the annotation style chosen for long Scratch. Journal of Machine Learning Research 12. titles, not fully compliant with UD, makes the 2461-2505. task even more difficult. For example the event “Settimana della moda femminile/Women fash- Péter Halácsy, András Kornai and Csaba Oravecz. 2007. HunPos – an open source trigram tagger. ion week” or “Giornata mondiale vittime Proceedings of the Demo and Poster Sessions of dell’amianto/World Day of the victims of the as- the 54th Annual Meeting of the ACL, pp. 209-212. bestos” are annotated as a sequence of PROPN in the gold test set as opposed to using the normal Xuezhe Ma and Eduard Hovy. 2016. End-to-end Se- grammatical conventions, as specified in the UD quence Labeling via Bi-directional LSTM-CNNs- CRF. Proceedings of the 54th Annual Meeting of guidelines. the Association for Computational Linguistics The traditional system is slightly more accu- (ACL 2016), pp. 1064-1074, Berlin, Germany. rate in predicting the distinction between VERB August 2016. (main verbs) and AUX (auxiliary and modal verbs): 19 errors against 26.