=Paper=
{{Paper
|id=Vol-1749/paper_015
|storemode=property
|title=Building the state–of–the–art in POS tagging of Italian Tweets
|pdfUrl=https://ceur-ws.org/Vol-1749/paper_015.pdf
|volume=Vol-1749
|authors=Andrea Cimino,Felice Dell'Orletta
|dblpUrl=https://dblp.org/rec/conf/clic-it/CiminoD16
}}
==Building the state–of–the–art in POS tagging of Italian Tweets==
Building the state-of-the-art in POS tagging of Italian Tweets Andrea Cimino and Felice Dell’Orletta Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR) ItaliaNLP Lab - www.italianlp.it {andrea.cimino, felice.dellorletta}@ilc.cnr.it Abstract trambi bidirezionali e ricorrenti di tipo Long Short Term Memory (LSTM). Men- English. In this paper we describe our ap- tre la prima rete neurale è una LSTM proach to EVALITA 2016 POS tagging for bidirezionale che prende in input vettori Italian Social Media Texts (PoSTWITA). che rappresentano le parole in maniera We developed a two-branch bidirectional tipica rispetto a precedenti lavori, la sec- Long Short Term Memory recurrent neu- onda prende in input una nuova rappre- ral network, where the first bi-LSTM uses sentazione vettoriale delle parole che con- a typical vector representation for the in- tiene informazioni sui caratteri contenuti put words, while the second one uses a evitando un incremento del costo com- newly introduced word-vector represen- putazionale del sistema rispetto a LSTM tation able to encode information about che prendono in input rappresentazioni the characters in the words avoiding the vettoriali delle sequenze di caratteri. Le increasing of computational costs due to rappresentazioni vettoriali ottenute dalle the hierarchical LSTM introduced by the due LSTM vengono in fine combinate at- character–based LSTM architectures. The traverso l’operatore di somma. Il nos- vector representations calculated by the tro sistema, utilizzando come dati anno- two LSTM are then merged by the sum tati solo quelli distribuiti dagli organiz- operation. Even if participants were al- zatori del task, quando valutato sul test lowed to use other annotated resources in set uffciale ha ottenuto il miglior risul- their systems, we used only the distributed tato nella competizione EVALITA 2016 data set to train our system. When evalu- PoSTWITA, riportando una accuratezza di ated on the official test set, our system out- 93.19%. Ulteriori esperimenti condotti performed all the other systems achieving dopo il periodo ufficiale di valutazione ci the highest accuracy score in EVALITA hanno permesso di sviluppare un sistema 2016 PoSTWITA, with a tagging accu- capace di raggiungre una accuratezza an- racy of 93.19%. Further experiments car- cora maggiore, mostrandoci l’importanza ried out after the official evaluation pe- dell’ingegnerizzazione manuale delle fea- riod allowed us to develop a system able tures anche quando vengono utilizzati al- to achieve a higher accuracy. These ex- goritmi di apprendimento basati su reti periments showed the central role played neurali. by the handcrafted features even when ma- chine learning algorithms based on neural networks are used. 1 Description of the system Italiano. In questo articolo descriviamo il Our approach to EVALITA 2016 PoSTWITA sistema che abbiamo utilizzato per parte- (Bosco et al., 2016) task was implemented in a cipare al task POS tagging for Italian So- software prototype operating on tokenized sen- cial Media Texts (PoSTWITA) della con- tences which assigns to each token a score ex- ferenza EVALITA 2016. Per questa parte- pressing its probability of belonging to a given cipazione abbiamo sviluppato un sistema part-of-speech class. The highest score represents basato su due reti neurali parallele en- the most probable class. Differently from the previous EVALITA part of periments, we considered a context window of speech tagging tasks (Tamburini (2007), Attardi 5 words. These models learn lower-dimensional and Simi (2009)), in EVALITA 2016 PoSTWITA word embeddings. Embeddings are represented by the participants must tackle the problem of analyz- a set of latent (hidden) variables, and each word is ing text with low conformance to common writ- a multidimensional vector that represent a specific ing practices. For example, capitalization rules instantiation of these variables. We built two Word may be ignored; excessive punctuation, particu- Embedding Lexicons starting from the following larly repeated ellipsis and question marks may be corpora: used, or spacing may be irregular (Agichtein et • The first lexicon was built using a tokenized al., 2008). Our development system strategy took version of the itWaC corpus3 . The itWaC cor- into account this issue. In particular, we imple- pus is a 2 billion word corpus constructed mented a multiple input bidirectional Long Short from the Web limiting the crawl to the .it Term Memory recurrent neural network (LSTM) domain and using medium-frequency words model. We developed a two-branched bidirec- from the Repubblica corpus and basic Italian tional LSTM (bi-LSTM) where the first bi-LSTM vocabulary lists as seeds. uses a typical vector representation of the input words commonly used for different classification • The second lexicon was built from a tok- tasks, while the second one uses a newly intro- enized corpus of tweets. This corpus was col- duced word-vector representation specifically de- lected using the Twitter APIs and is made up signed to handle peculiarities of ill-formed or not of 10,700,781 Italian tweets. standard texts typical of social media texts. To create the input vectors for the two branches In addition to these two lexicons, we built an- we use a combination of different components ex- other word embedding lexicon based on fastText tracted from three different word embedding lex- (Bojanowski et al., 2016), a library for efficient icons, from a manually created morpho-syntactic learning of word representations and sentence lexicon and from handcrafted features specifically classification. FastText allows to overcome the defined to improve the accuracy of the system problem of out-of-vocabulary words which affects when tested on social media texts. the relying methodology of word2vec. Generat- In this work we used Keras (Chollet, 2016) deep ing out-of-vocabulary word embeddings is a typi- learning framework to generate the neural network cal issue for morphologically rich languages with models. large vocabularies and many rare words. FastText overcomes this limitation by representing each 1.1 Lexicons word as a bag of character n-grams. A vector rep- In order to improve the overall accuracy of our resentation is associated to each character n-gram system, we developed three word embedding lex- and the word is represented as the sum of these icons1 and we used a manually created morpho- character n-gram representations. To build the lex- syntactic lexicon. icon based on fastText, we adopted as learning cor- pus the same set of tokenized tweets used to build 1.1.1 Word Embedding lexicons the word2vec based lexicon. Since the lexical information in tweets can be very 1.1.2 Morpho-syntactic lexicon sparse, to overcame this problem we built three We used a large Italian lexicon of about 1,300,000 word embedding lexicons. forms, developed as part of the SemaWiki For this purpose, we trained two predict mod- project4 . The full-form lexicon was generated els using the word2vec2 toolkit (Mikolov et al., from a base lexicon of 65,500 lemmas, initially in- 2013). As recommended in (Mikolov et al., 2013), spired by the Zanichelli dictionary5 , and updated we used the CBOW model that learns to pre- along several years and cross-checked with other dict the word in the middle of a symmetric win- online dictionaries6 . For each form the lexicon dow based on the sum of the vector representa- 3 tions of the words in the window. For our ex- http://wacky.sslmit.unibo.it/doku.php?id=corpora 4 http://medialab.di.unipi.it/wiki/SemaWiki 1 5 The three word embedding lexicons are freely available Zingarelli: Il nuovo Zingarelli minore, 2008. at the following website: http://www.italianlp.it/. 6 Aldo Gabrielli: Il Grande Dizionario di Italiano; Tullio 2 http://code.google.com/p/word2vec/ De Mauro: Il Dizionario della lingua italiana. contains all the possible parts-of-speech and pro- tion about the characters in each word of a sen- vides information on morpho-syntactic features, tence. The vector representations calculated by but using a different tagset (ISST-TANL Tagsets 7 ) the two LSTM are then merged by the sum opera- with respect to the one used for PoSTWITA. tion. For what concerns the optimization process, categorical cross-entropy is used as a loss func- 1.2 The POS tagger architecture tion and the optimization process is performed The LSTM unit was initially proposed by Hochre- by the rmsprop optimizer (Tieleman and Hinton, iter and Schmidhuber (Hochreiter et al., 1997). 2012). Each bidirectional LSTM branch is con- LSTM units are able to propagate an important figured to have 24 units. In addition, we applied feature that came early in the input sequence over a dropout factor to both input gates and to the re- a long distance, thus capturing potential long- current connections in order to prevent overfitting distance dependencies. This type of neural net- which is a typical issue of neural networks (Galp work was recently tested on Sentiment Analy- and Ghahramani, 2015). As suggested in (Galp sis tasks (Tang et al., 2015), (Xu et al., 2016) and Ghahramani, 2015) we have chosen a dropout where it has been proven to outperform classifi- factor value in the optimum range [0.3, 0.5], more cation performance in several sentiment analysis specifically 0.35 for each branch. task (Nakov et al., 2016) with respect to com- Bag-of-Character monly used learning algorithms, showing a 3-4 Word-based LSTM Word-based LSTM points of improvements. Similar big improve- ments have not been obtained in tagging tasks, such as Part-Of-Speech tagging. This is most due Merged vectors to the fact that state-of-the art systems for part of speech tagging exploit strong performing learning algorithms and hard feature engineering. In ad- Categorical dition, a little knowledge of the surrounding con- crossentropy text is enough to reach very high tagging perfor- mance. On the contrary, LSTM networks per- Figure 1: Diagram of the two-branched bi-LSTM form very well with respect to other learning al- architecture. gorithms when word dependencies are long. Al- though without a big improvement, POS tagging 1.2.1 Word-based bi-LSTM systems which exploit LSTM as learning algo- In this part, we describe the Word-based bidirec- rithm have been proven to reach state-of-the-art tional LSTM branch of the proposed neural net- performances both when analyzing text at char- work architecture and the word level information acter level (Ling et al., 2015) and at word level given in input to this layer. Each word is repre- (Wang et al., 2016). More specifically they used a sented by a low dimensional, continuous and real- bidirectional LSTM allows to capture long-range valued vector, also known as word embedding and dependencies from both directions of a sentence all the word vectors are stacked in a word em- by constructing bidirectional links in the network bedding matrix. To train this LSTM branch, each (Schuster et al., 1997). In addition, (Plank et al., input word in the tweet is represented by a 979- 2016) have proposed a model which takes into ac- dimensional vector which is composed by: count at the same time both word level and char- Word2vec word embeddings: the concatenation acter level information, showing very good results of the two word embeddings extracted by the for many languages. As proposed by these sys- two available word2vec Word Embedding Lexi- tems, we employed a bidirectional LSTM archi- cons (128 components for each word embedding, tecture. We implemented a 2-branch bidirectional thus resulting in a total of 256 components), and LSTM but instead of using the character based for each word embedding an extra component was branch we introduced another specific word level added in order to handle the ”unknown word” (2 branch in order to reduce the computational cost components). of the hierarchical LSTM introduced by the char- FastText word embeddings: the word embed- acter based LSTM. This branch encodes informa- dings extracted by the fastText Word Embedding 7 http://www.italianlp.it/docs/ISST-TANL-POStagset.pdf Lexicon (128 components). Morpho-syntactic category: the parts-of-speech case characters. and the corresponding morpho-syntactic features Contains alphanumeric: a component indicating obtained by exploiting the Morpho-syntactic lexi- whether or not the current word contains alphanu- con, resulting in 293 components. meric characters Spell checker: the parts-of-speech and the corre- Contains not alphanumeric: a component indi- sponding morpho-syntactic features of the word cating whether or not the current word contains obtained by analyzing the current word using non alphanumeric characters a spell checker (pyenchant8 ) and exploiting the Contains alphabetics: a component indicating Morpho-syntactic lexicon, resulting in 295 com- whether or not the current word contains alpha- ponents. betic characters. Word length: a component representing the Contains not alphabetics: a component indicat- length of the analyzed word. ing whether or not the current word contains non Is URL: a component indicating whether the alphabetic characters. ”http” substring is contained in the analyzed word. End of sentence: a component indicating whether Is uppercase: a component indicating if the ana- the sentence was totally read. lyzed word is uppercase. Is capitalized: a component indicating if the ana- 2 Results and Discussion lyzed word is capitalized. To develop our system, we created an internal End of sentence: a component indicating whether development set of 368 tweets randomly selected or not the sentence was totally read. from the training set distributed by the task organizers. The first row in Table 1 reports the 1.2.2 Bag-of-Character Word-based accuracy achieved by our final system on the bi-LSTM internal development set and on the official test In this part, we describe the Bag-of-Character set (row Two-branch bi-LSTM). Word-based bidirectional LSTM branch of the proposed neural network architecture and the word Configuration Devel Test level information given in input to this layer. Dif- Two-branch bi-LSTM 96.55 93.19 ferently from the Word-based LSTM branch, in this branch we did not use pretrained vectors. To Word bi-LSTM 96.03 92.35 train this LSTM branch, each input word in the Bag-of-Char. Word bi-LSTM 84.47 80.77 tweet is represented by a 316-dimensional vector No Morpho-syntactic lexicon 96.48 93.54 which is composed by: No spell checker 96.49 93.31 Characters: a vector representing the set of char- No word2vec lexicons 93.23 89.87 acters which compose the current word. Since our No fastText lexicon 95.85 92.43 considered alphabet is composed by 173 different No feature engineering 96.39 93.06 characters, the resulting in a 173-dimensional vec- tor. Table 1: Tagging accuracy (in percentage) of the Lowercased characters: 134 components rep- different learning models on our development set resenting the set of lowercased characters which and the official test set. compose the current word. Has numbers: a component indicating whether or We tested different configurations of our system not the current word contains a number. in order to evaluate the contribution on the tag- Contains not numbers: a component indicating ging accuracy of: i) each branch in the proposed whether or not the current word contains non num- architecture, ii) the different word embedding and bers. morpho-syntactic lexicons and iii) the handcrafted Contains lowercased: a component indicating features. We carried out different experiments that whether or not the current word contains lower- reflect the questions we wanted to answer, more case characters. specifically the questions are: Contains uppercased: a component indicating whether or not the current word contains upper- • (a) what are the contributions of the Word- based bi-LSTM and of the Bag-of-Character 8 http://pythonhosted.org/pyenchant/ Word-based bi-LSTM? • (b) what is the contribution of the Morpho- both configurations we excluded the other Word- syntactic lexicon? based LSTM components, while we left all the components of the Bag-of-Character Word-based • (c) what is the contribution of the spell LSTM. The results show that word2vec seems to checker? be a better choice with respect to fastText, both in development and in test sets. This is in con- • (d) what is the contribution of fastText with trast with what we would have expected consider- respect to word2vec Word Embedding lexi- ing that fastText learns the word embedding rep- cons? resentation using subword information that should be particularly useful for the analysis of non stan- In order to answer to the question (a), first we dard text such as social media ones. run the Word-based LSTM excluding the Bag-of- Character Word-based bi-LSTM branch, then we 2.1 Single bi-LSTM and Handcrafted excluded the Word-based bi-LSTM to verify the features Bag-of-Character Word based bi-LSTM contribu- After the submission of the final system results, we tion. The results of these experiments are reported devised two further experiments. The first one was in Word bi-LSTM and Bag-of-Char. Word bi-LSTM devoted to testing the tagging performances of a rows in Table 1. The Word-based bi-LSTM is single word-based bi-LSTM architecture with re- clearly the best performer with respect to the Bag- spect to the presented Two-branch bi-LSTM. The of-Character one, but remarkable is that our pro- second experiment was aimed to study the effect posed two-branch architecture shows an improve- of handcrafted features combined with the learn- ment of about 0.5 points in the development set ing ones. To this aim, we developed a Part-of- with respect to the best single bi-LSTM. The same Speech tagger based on a single word-based bi- behaviour is shown in the test set, where the com- LSTM, where each input word vector is the con- bined system achieves an improvement of 0.84 catenation of the two input word representations points with respect to the single Word-based bi- of the bi-LSTMs presented in Section 1.2.1 and LSTM. Section 1.2.2. In order to answer to the question (b), we ex- Table 2 reports the results of these experiments. cluded from the input vectors of the Word-based As shown in the Single bi-LSTM row, the use of the bi-LSTM branch the morpho-syntactic category single architecture instead of the two-branch one components extracted from Morpho-syntactic lex- does not affect tagging results, actually the single icon. Row No Morpho-syntactic lexicon reports bi-LSTM slightly outperforms the two-branch ar- the results and shows that this information gives chitecture when tested on the test set (+0.48%). a negligible improvement on the development set In order to evaluate the effect of handcrafted and unexpectedly a slight drop on the test set. features, we conducted a last experiment where For what concerns the question (c), we excluded we removed all the components from the input the morpho-syntactic category components of the vectors of the single Word-based bi-LSTM with word obtained using the spell checker. The results the exceptions of word2vec and fastText word em- are reported in the No spell checker row. Simi- beddings. No handcrafted features row shows the larly to what happened in the (b) experiment, also relevance of the handcrafted features that yield such information do not contribute in increasing an improvement of 1.34% and 1.68% on the de- the tagging performances. velopment and the test sets respectively. These In order to compare the contributions of fast- results show the important role of feature engi- Text and word2vec lexicons (question (d)), we neering even when neural networks learning algo- considered two different system configurations: rithms are used. one removing the two word2vec lexicons (No word2vec lexicons row) and one removing fastText 3 Conclusion and itWac word2vec lexicons (No fastText lexicon row). In this second configuration, we removed In this paper we reported the results of our partici- also the itWac word2vec lexicon to compare fast- pation to the EVALITA 2016 POS tagging for Ital- Text and word2vec using the same learning corpus ian Social Media Texts (PoSTWITA). By resort- (the twitter corpus described in section 1.1.1). In ing to a two-branch bidirectional LSTM, word em- Configuration Devel Test Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural computation. Single bi-LSTM 96.39 93.67 No handcrafted features 95.22 91.99 Wang Ling, Chris Dyer, Alan W. Black, Isabel Tran- coso, Ramon Fermandez, Silvio Amir, Luis Marujo and Luis Tiago 2016. Finding Function in Form: Table 2: Tagging accuracy of the single word- Compositional Character Models for Open Vocab- based bi-LSTM on our development set and the ulary Word Representation. In Proceedings of the official test set. 2015 Conference on Empirical Methods in Natural Language Processing. 1520–1530 , Lisbon, Portu- gal. ACL. beddings and morpho-syntactic lexicons and hand crafted features we achieved the best score. In par- Tomas Mikolov, Kai Chen, Greg Corrado and Jef- frey Dean. 2013. Efficient estimation of word ticular, we showed the relevance of handcrafted representations in vector space. arXiv preprint features that allowed an improvement of more than arXiv1:1301.3781. one percentage point in terms of tagging accuracy both in development and test sets when combined Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio Sebastiani and Veselin Stoyanov 2016. SemEval- with learned features such as word embedding lex- 2016 task 4: Sentiment analysis in Twitter. In Pro- icons. As future research direction we will test the ceedings of the 10th International Workshop on Se- contribution of a pure character based LSTM with mantic Evaluation (SemEval-2016). respect to character handcrafted features. Barbara Plank, Anders Søgaard and Yoav Goldberg. 2016. Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models an References Auxiliary Loss. In Proceedings of the 54th Annual Eugene Agichtein and Carlos Castillo and Debora Do- Meeting of the Association for Computational Lin- nato and Aristides Gionis and Gilad Mishne. 2008. guistics (Volume 2: Short Papers). August, Berlin, Finding High-quality Content in Social Media. In Germany. Proceedings of the 2008 International Conference Mike Schuster and Kuldip K. Paliwal. 1997. Bidirec- on Web Search and Data Mining. New York, USA. tional recurrent neural networks. IEEE Transactions Giuseppe Attardi and Maria Simi. 2009. Overview on Signal Processing 45(11):2673–2681. of the EVALITA 2009 Part-of-Speech Tagging Task. Fabio Tamburini. 2007. Evalita 2007: The Part-of- In Proceedings of Evalita ’09, Evaluation of NLP Speech Tagging Task. In Proceedings of Evalita and Speech Tools for Italian. December, Reggio ’07, Evaluation of NLP and Speech Tools for Ital- Emilia, Italy. ian. September, Rome, Italy. Piotr Bojanowski, Edouard Grave, Armand Joulin and Duyu Tang, Bing Qin and Ting Liu. 2015. Docu- Tomas Mikolov. 2016. Enriching Word Vec- ment modeling with gated recurrent neural network tors with Subword Information. arXiv preprint for sentiment classification. In Proceedings of the arXiv:607.04606. 2015 Conference on Empirical Methods in Natural Cristina Bosco and Fabio Tamburini and Andrea Boli- Language Processing 1422-1432, Lisbon, Portugal. oli and Alessandro Mazzei. 2016. Overview of the EVALITA 2016 Part Of Speech on TWitter for ITAl- Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture ian Task. In Proceedings of Third Italian Confer- 6.5-RmsProp: Divide the gradient by a running aver- ence on Computational Linguistics (CLiC-it 2016) age of its recent magnitude. In COURSERA: Neural & Fifth Evaluation Campaign of Natural Language Networks for Machine Learning. Processing and Speech Tools for Italian. Final Work- XingYi Xu, HuiZhi Liang and Timothy Baldwin. 2016. shop (EVALITA 2016). Associazione Italiana di Lin- UNIMELB at SemEval-2016 Tasks 4A and 4B: guistica Computazionale (AILC). An Ensemble of Neural Networks and a Word2Vec François Chollet. 2016. Keras. Software available at Based Model for Sentiment Classification. In Pro- https://github.com/fchollet/keras/tree/master/keras. ceedings of the 10th International Workshop on Se- mantic Evaluation (SemEval-2016). Cı̀cero Nogueira dos Santos and Bianca Zadrozny. 2013. Learning Character-level Representations for Peilu Wang, Yao Qian, Frank K. Soong, Lei He, and Part-of-Speech Tagging. In Proceedings of the Hai Zhao . 2016. Learning Distributed Word Repre- 31st International Conference on Machine Learning sentations For Bidirectional LSTM Recurrent Neu- (ICML 2014). ral Network. In Proceedings of the 2016 Conference of the North American Chapter of the Association Yarin Gal and Zoubin Ghahramani. 2015. A theoret- for Computational Linguistics: Human Language ically grounded application of dropout in recurrent Technologies 527–533, San Diego, CA, USA. ACL. neural networks. arXiv preprint arXiv:1512.05287.